ntpd
to chronyd
policy.cfg
and Deploy Ceph Dashboard using DeepSeaCopyright © 2019 SUSE LLC
Copyright © 2016, RedHat, Inc, and contributors.
The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. All other trademarks are the property of their respective owners.
For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.
SUSE Enterprise Storage 6 is an extension to SUSE Linux Enterprise Server 15 SP1. It combines the capabilities of the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage 6 provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.
This guide helps you understand the concept of the SUSE Enterprise Storage 6 with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.
Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.
For an overview of the documentation available for your product and the latest documentation updates, refer to http://www.suse.com/documentation.
The following manuals are available for this product:
The guide describes various administration tasks that are typically
performed after the installation. The guide also introduces steps to
integrate Ceph with virtualization solutions such as libvirt
, Xen,
or KVM, and ways to access objects stored in the cluster via iSCSI and
RADOS gateways.
Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.
HTML versions of the product manuals can be found in the installed system
under /usr/share/doc/manual
. Find the latest
documentation updates at http://www.suse.com/documentation where you can download the
manuals for your product in multiple formats.
Several feedback channels are available:
For services and support options available for your product, refer to http://www.suse.com/support/.
To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select › .
We want to hear your comments and suggestions for this manual and the other documentation included with this product. Use the User Comments feature at the bottom of each page in the online documentation or go to http://www.suse.com/documentation/feedback.html and enter your comments there.
For feedback on the documentation of this product, you can also send a
mail to doc-team@suse.de
. Make sure to include the
document title, the product version, and the publication date of the
documentation. To report errors or suggest enhancements, provide a concise
description of the problem and refer to the respective section number and
page (or URL).
The following typographical conventions are used in this manual:
/etc/passwd
: directory names and file names
placeholder: replace placeholder with the actual value
PATH
: the environment variable PATH
ls
, --help
: commands, options, and
parameters
user
: users or groups
Alt, Alt–F1: a key to press or a key combination; keys are shown in uppercase as on a keyboard
, › : menu items, buttons
Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.
This book is written in GeekoDoc, a subset of DocBook (see
http://www.docbook.org). The XML source files were
validated by xmllint
, processed by
xsltproc
, and converted into XSL-FO using a customized
version of Norman Walsh's stylesheets. The final PDF can be formatted through
FOP from Apache or through XEP from RenderX. The authoring and publishing
tools used to produce this manual are available in the package
daps
. The DocBook Authoring and
Publishing Suite (DAPS) is developed as open source software. For more
information, see http://daps.sf.net/.
The Ceph project and its documentation is a result of hundreds of contributors and organizations. See https://ceph.com/contributors/ for more details.
SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referr…
The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.
Admin Node is a Ceph cluster node where the Salt master service is running. The admin node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph …
SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referred to as nodes) and into the petabyte range. As opposed to conventional systems which have allocation tables to store and fetch data, Ceph uses a deterministic algorithm to allocate storage for data and has no centralized information structure. Ceph assumes that in storage clusters the addition or removal of hardware is the rule, not the exception. The Ceph cluster automates management tasks such as data distribution and redistribution, data replication, failure detection and recovery. Ceph is both self-healing and self-managing which results in a reduction of administrative and budget overhead.
This chapter provides a high level overview of SUSE Enterprise Storage 6 and briefly describes the most important components.
Since SUSE Enterprise Storage 5, the only cluster deployment method is DeepSea. Refer to Chapter 4, Deploying with DeepSea/Salt for details about the deployment process.
The Ceph environment has the following features:
Ceph can scale to thousands of nodes and manage storage in the range of petabytes.
No special hardware is required to run a Ceph cluster. For details, see Chapter 2, Hardware Requirements and Recommendations
The Ceph cluster is self-managing. When nodes are added, removed or fail, the cluster automatically redistributes the data. It is also aware of overloaded disks.
No node in a cluster stores important information alone. The number of redundancies can be configured.
Ceph is an open source software solution and independent of specific hardware or vendors.
To make full use of Ceph's power, it is necessary to understand some of the basic components and concepts. This section introduces some parts of Ceph that are often referenced in other chapters.
The basic component of Ceph is called RADOS (Reliable Autonomic Distributed Object Store). It is responsible for managing the data stored in the cluster. Data in Ceph is usually stored as objects. Each object consists of an identifier and the data.
RADOS provides the following access methods to the stored objects that cover many use cases:
Object Gateway is an HTTP REST gateway for the RADOS object store. It enables direct access to objects stored in the Ceph cluster.
RADOS Block Devices (RBD) can be accessed like any other block device.
These can be used for example in combination with libvirt
for
virtualization purposes.
The Ceph File System is a POSIX-compliant file system.
librados
librados
is a library that can
be used with many programming languages to create an application capable
of directly interacting with the storage cluster.
librados
is used by Object Gateway and RBD
while CephFS directly interfaces with RADOS
Figure 1.1, “Interfaces to the Ceph Object Store”.
At the core of a Ceph cluster is the CRUSH algorithm. CRUSH is the acronym for Controlled Replication Under Scalable Hashing. CRUSH is a function that handles the storage allocation and needs comparably few parameters. That means only a small amount of information is necessary to calculate the storage position of an object. The parameters are a current map of the cluster including the health state, some administrator-defined placement rules and the name of the object that needs to be stored or retrieved. With this information, all nodes in the Ceph cluster are able to calculate where an object and its replicas are stored. This makes writing or reading data very efficient. CRUSH tries to evenly distribute data over all nodes in the cluster.
The CRUSH map contains all storage nodes and administrator-defined placement rules for storing objects in the cluster. It defines a hierarchical structure that usually corresponds to the physical structure of the cluster. For example, the data-containing disks are in hosts, hosts are in racks, racks in rows and rows in data centers. This structure can be used to define failure domains. Ceph then ensures that replications are stored on different branches of a specific failure domain.
If the failure domain is set to rack, replications of objects are distributed over different racks. This can mitigate outages caused by a failed switch in a rack. If one power distribution unit supplies a row of racks, the failure domain can be set to row. When the power distribution unit fails, the replicated data is still available on other rows.
In Ceph, nodes are servers working for the cluster. They can run several different types of daemons. It is recommended to run only one type of daemon on each node, except for Ceph Manager daemons which can be collocated with Ceph Monitors. Each cluster requires at least Ceph Monitor, Ceph Manager and Ceph OSD daemons:
Admin Node is a Ceph cluster node where the Salt master service is running. The Admin Node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph Dashboard Web UI with the Grafana dashboard backed by the Prometheus monitoring toolkit.
Ceph Monitor (often abbreviated as MON) nodes maintain information about the cluster health state, a map of all nodes and data distribution rules (see Section 1.2.2, “CRUSH”).
If failures or conflicts occur, the Ceph Monitor nodes in the cluster decide by majority which information is correct. To form a qualified majority, it is recommended to have an odd number of Ceph Monitor nodes, and at least three of them.
If more than one site is used, the Ceph Monitor nodes should be distributed over an odd number of sites. The number of Ceph Monitor nodes per site should be such that more than 50% of the Ceph Monitor nodes remain functional if one site fails.
The Ceph manager (MGR) collects the state information from the whole cluster. The Ceph manager daemon runs alongside the monitor daemons. It provides additional monitoring, and interfaces the external monitoring and management systems.
The Ceph manager requires no additional configuration, beyond ensuring it is running. You can deploy it as a separate role using DeepSea.
An Ceph OSD is a daemon handling Object Storage Devices which are a physical or logical storage units (hard disks or partitions). Object Storage Devices can be physical disks/partitions or logical volumes. The daemon additionally takes care of data replication and rebalancing in case of added or removed nodes.
Ceph OSD daemons communicate with monitor daemons and provide them with the state of the other OSD daemons.
To use CephFS, Object Gateway, NFS Ganesha, or iSCSI Gateway, additional nodes are required:
The Metadata Servers store metadata for the CephFS. By using an MDS you can
execute basic file system commands such as ls
without
overloading the cluster.
The Object Gateway is an HTTP REST gateway for the RADOS object store. It is compatible with OpenStack Swift and Amazon S3 and has its own user management.
NFS Ganesha provides an NFS access to either the Object Gateway or the CephFS. It runs in the user instead of the kernel space and directly interacts with the Object Gateway or CephFS.
iSCSI is a storage network protocol that allows clients to send SCSI commands to SCSI storage devices (targets) on remote servers.
The Samba Gateway provides a SAMBA access to data stored on CephFS.
Objects that are stored in a Ceph cluster are put into pools. Pools represent logical partitions of the cluster to the outside world. For each pool a set of rules can be defined, for example, how many replications of each object must exist. The standard configuration of pools is called replicated pool.
Pools usually contain objects but can also be configured to act similar to a RAID 5. In this configuration, objects are stored in chunks along with additional coding chunks. The coding chunks contain the redundant information. The number of data and coding chunks can be defined by the administrator. In this configuration, pools are referred to as erasure coded pools.
Placement Groups (PGs) are used for the distribution of data within a pool. When creating a pool, a certain number of placement groups is set. The placement groups are used internally to group objects and are an important factor for the performance of a Ceph cluster. The PG for an object is determined by the object's name.
This section provides a simplified example of how Ceph manages data (see
Figure 1.2, “Small Scale Ceph Example”). This example
does not represent a recommended configuration for a Ceph cluster. The
hardware setup consists of three storage nodes or Ceph OSDs
(Host 1
, Host 2
, Host
3
). Each node has three hard disks which are used as OSDs
(osd.1
to osd.9
). The Ceph Monitor nodes are
neglected in this example.
While Ceph OSD or Ceph OSD daemon refers to a daemon that is run on a node, the word OSD refers to the logical disk that the daemon interacts with.
The cluster has two pools, Pool A
and Pool
B
. While Pool A replicates objects only two times, resilience for
Pool B is more important and it has three replications for each object.
When an application puts an object into a pool, for example via the REST
API, a Placement Group (PG1
to PG4
)
is selected based on the pool and the object name. The CRUSH algorithm then
calculates on which OSDs the object is stored, based on the Placement Group
that contains the object.
In this example the failure domain is set to host. This ensures that replications of objects are stored on different hosts. Depending on the replication level set for a pool, the object is stored on two or three OSDs that are used by the Placement Group.
An application that writes an object only interacts with one Ceph OSD, the primary Ceph OSD. The primary Ceph OSD takes care of replication and confirms the completion of the write process after all other OSDs have stored the object.
If osd.5
fails, all object in PG1
are
still available on osd.1
. As soon as the cluster
recognizes that an OSD has failed, another OSD takes over. In this example
osd.4
is used as a replacement for
osd.5
. The objects stored on osd.1
are then replicated to osd.4
to restore the replication
level.
If a new node with new OSDs is added to the cluster, the cluster map is going to change. The CRUSH function then returns different locations for objects. Objects that receive new locations will be relocated. This process results in a balanced usage of all OSDs.
BlueStore is a new default storage back end for Ceph since SUSE Enterprise Storage 5. It has better performance than FileStore, full data check-summing, and built-in compression.
BlueStore manages either one, two, or three storage devices. In the simplest case, BlueStore consumes a single primary storage device. The storage device is normally partitioned into two parts:
A small partition named BlueFS that implements file system-like functionalities required by RocksDB.
The rest of the device is normally a large partition occupying the rest of the device. It is managed directly by BlueStore and contains all of the actual data. This primary device is normally identified by a block symbolic link in the data directory.
It is also possible to deploy BlueStore across two additional devices:
A WAL device can be used for BlueStore’s internal
journal or write-ahead log. It is identified by the
block.wal
symbolic link in the data directory. It is only
useful to use a separate WAL device if the device is faster than the primary
device or the DB device, for example when:
The WAL device is an NVMe, and the DB device is an SSD, and the data device is either SSD or HDD.
Both the WAL and DB devices are separate SSDs, and the data device is an SSD or HDD.
A DB device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the embedded RocksDB) will put as much metadata as it can on the DB device to improve performance. Again, it is only helpful to provision a shared DB device if it is faster than the primary device.
Plan thoroughly for the sufficient size of the DB device. If the DB device fills up, metadata will be spilling over to the primary device which badly degrades the OSD's performance.
You can check if a WAL/DB partition is getting full and spilling over with
the ceph daemon osd.ID perf
dump
command. The slow_used_bytes
value shows
the amount of data being spilled out:
cephadm >
ceph daemon osd.ID perf dump | jq '.bluefs'
"db_total_bytes": 1073741824,
"db_used_bytes": 33554432,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 554432,
"slow_used_bytes": 554432,
Ceph as a community project has its own extensive online documentation. For topics not found in this manual, refer to http://docs.ceph.com/docs/master/.
The original publication CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data by S.A. Weil, S.A. Brandt, E.L. Miller, C. Maltzahn provides helpful insight into the inner workings of Ceph. Especially when deploying large scale clusters it is a recommended reading. The publication can be found at http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf.
The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.
In general, the recommendations given in this section are on a per-process basis. If several processes are located on the same machine, the CPU, RAM, disk and network requirements need to be added up.
SUSE Enterprise Storage supports both x86 and Arm architectures. When considering each architecture, it is important to note that from a cores per OSD, frequency, and RAM perspective, there is no real difference between CPU architectures for sizing.
Like with smaller x86 processors (non-server), lower-performance Arm-based cores may not provide an optimal experience, especially when used for erasure coded pools.
At least four OSD nodes, with eight OSD disks each, are required.
Three Ceph Monitor nodes (requires SSD for dedicated OS drive).
iSCSI Gateways, Object Gateways and Metadata Servers require incremental 4 GB RAM and four cores.
Ceph Monitors, Object Gateways and Metadata Servers nodes require redundant deployment.
Separate Admin Node with 4 GB RAM, four cores, 1 TB capacity. This is typically the Salt master node. Ceph services and gateways, such as Ceph Monitor, Ceph Manager, Metadata Server, Ceph OSD, Object Gateway, or NFS Ganesha are not supported on the Admin Node.
CPU recommendations:
1x 2GHz CPU Thread per spinner
2x 2GHz CPU Thread per SSD
4x 2GHz CPU Thread per NVMe
Separate 10 GbE networks (public/client and back-end), required 4x 10 GbE, recommended 2x 25 GbE.
Total RAM required = number of OSDs x (1 GB +
osd_memory_target
) + 16 GB
Refer to Book “Administration Guide”, Chapter 14 “Ceph Cluster Configuration”, Section 14.2.1 “Automatic Cache Sizing” for more details on
osd_memory_target
.
OSD disks in JBOD configurations or or individual RAID-0 configurations.
OSD journal can reside on OSD disk.
OSD disks should be exclusively used by SUSE Enterprise Storage.
Dedicated disk/SSD for the operating system, preferably in a RAID 1 configuration.
If this OSD host will host part of a cache pool used for cache tiering, allocate at least an additional 4 GB of RAM.
Ceph Monitors, gateway and Metadata Servers can reside on Object Storage Nodes.
OSD nodes should be bare metal, not virtualized, for disk performance reasons.
There are two types of disk space needed to run on OSD: the space for the disk journal (for FileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. The minimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is 5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0.
So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smaller than 20 GB, even for testing purposes.
We recommend reserving 4 GB for the WAL device. Recommended size for DB is 64 GB for more workloads.
If you intend to put the WAL and DB device on the same disk, then we recommend using a single partition for both devices, rather than having a separate partition for each. This allows Ceph to use the DB device for the WAL operation as well. Management of the disk space is therefore more effective as Ceph uses the DB partition for the WAL only if there is a need for it. Another advantage is that the probability that the WAL partition gets full is very small, and when it is not entirely used then its space is not wasted but used for DB operation.
To share the DB device with the WAL, do not specify the WAL device, and specify only the DB device.
Find more information about specifying an OSD layout in Section 4.5.2, “DriveGroups”.
You can have as many disks in one server as it allows. There are a few things to consider when planning the number of disks per server:
Network bandwidth. The more disks you have in a server, the more data must be transferred via the network card(s) for the disk write operations.
Memory. RAM above 2GB is used for the BlueStore
cache. With the default osd_memory_target
of 4GB, the
system has a reasonable starting cache size for spinning media. If using
SSD or NVME, consider increasing the cache size and RAM allocation per
OSD to maximize performance.
Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the cluster temporarily loses. Moreover, to keep the replication rules running, you need to copy all the data from the failed server among the other nodes in the cluster.
At least three Ceph Monitor nodes are required. The number of monitors should always be odd (1+2n).
4 GB of RAM.
Processor with four logical cores.
An SSD or other sufficiently fast storage type is highly recommended for
monitors, specifically for the /var/lib/ceph
path on
each monitor node, as quorum may be unstable with high disk latencies. Two
disks in RAID 1 configuration is recommended for redundancy. It is
recommended that separate disks or at least separate disk partitions are
used for the monitor processes to protect the monitor's available disk
space from things like log file creep.
There must only be one monitor process per node.
Mixing OSD, monitor, or Object Gateway nodes is only supported if sufficient hardware resources are available. That means that the requirements for all services need to be added up.
Two network interfaces bonded to multiple switches.
Object Gateway nodes should have six to eight CPU cores and 32 GB of RAM (64 GB recommended). When other processes are co-located on the same machine, their requirements need to be added up.
Proper sizing of the Metadata Server nodes depends on the specific use case. Generally, the more open files the Metadata Server is to handle, the more CPU and RAM it needs. Following are the minimal requirements:
3G of RAM per one Metadata Server daemon.
Bonded network interface.
2.5 GHz CPU with at least 2 cores.
At least 4 GB of RAM and a quad-core CPU are required. This is includes running the Ceph Dashboard on the Admin Node. For large clusters with hundreds of nodes, 6 GB of RAM is suggested.
iSCSI nodes should have six to eight CPU cores and 16 GB of RAM.
The network environment where you intend to run Ceph should ideally be a bonded set of at least two network interfaces that is logically split into a public part and a trusted internal part using VLANs. The bonding mode is recommended to be 802.3ad if possible to provide maximum bandwidth and resiliency.
The public VLAN serves to provide the service to the customers, while the internal part provides for the authenticated Ceph network communication. The main reason for this is that although Ceph provides authentication and protection against attacks once secret keys are in place, the messages used to configure these keys may be transferred openly and are vulnerable.
If your storage nodes are configured via DHCP, the default timeouts may not
be sufficient for the network to be configured correctly before the various
Ceph daemons start. If this happens, the Ceph MONs and OSDs will not
start correctly (running systemctl status ceph\*
will
result in "unable to bind" errors) To avoid this issue, we recommend
increasing the DHCP client timeout to at least 30 seconds on each node in
your storage cluster. This can be done by changing the following settings
on each node:
In /etc/sysconfig/network/dhcp
, set
DHCLIENT_WAIT_AT_BOOT="30"
In /etc/sysconfig/network/config
, set
WAIT_FOR_INTERFACES="60"
If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates fine with a public network, its performance and security improves when you set a second private cluster network. To support two networks, each Ceph node needs to have at least two network cards.
You need to apply the following changes to each Ceph node. It is relatively quick to do for a small cluster, but can be very time consuming if you have a cluster consisting of hundreds or thousands of nodes.
Stop Ceph related services on each cluster node.
Add a line to /etc/ceph/ceph.conf
to define the
cluster network, for example:
cluster network = 10.0.0.0/24
If you need to specifically assign static IP addresses or override
cluster network
settings, you can do so with the
optional cluster addr
.
Check that the private cluster network works as expected on the OS level.
Start Ceph related services on each cluster node.
root #
systemctl start ceph.target
If the monitor nodes are on multiple subnets, for example they are located
in different rooms and served by different switches, you need to adjust the
ceph.conf
file accordingly. For example if the nodes
have IP addresses 192.168.123.12, 1.2.3.4, and 242.12.33.12, add the
following lines to its global section:
[global] [...] mon host = 192.168.123.12, 1.2.3.4, 242.12.33.12 mon initial members = MON1, MON2, MON3 [...]
Additionally, if you need to specify a per-monitor public address or
network, you need to add a
[mon.X]
section per each
monitor:
[mon.MON1] public network = 192.168.123.0/24 [mon.MON2] public network = 1.2.3.0/24 [mon.MON3] public network = 242.12.33.12/0
Ceph does not generally support non-ASCII characters in configuration files, pool names, user names and so forth. When configuring a Ceph cluster we recommend using only simple alphanumeric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/configuration names.
Seven Object Storage Nodes
No single node exceeds ~15% of total storage
10 Gb Ethernet (four physical networks bonded to multiple switches)
56+ OSDs per storage cluster
RAID 1 OS disks for each OSD storage node
SSDs for Journal with 6:1 ratio SSD journal to OSD
1.5 GB of RAM per TB of raw OSD capacity for each Object Storage Node
2 GHz per OSD for each Object Storage Node
Dedicated physical infrastructure nodes
Three Ceph Monitor nodes: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk
One SES management node: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk
Redundant physical deployment of gateway or Metadata Server nodes:
Object Gateway nodes: 32 GB RAM, 8 core processor, RAID 1 SSDs for disk
iSCSI Gateway nodes: 16 GB RAM, 4 core processor, RAID 1 SSDs for disk
Metadata Server nodes (one active/one hot standby): 32 GB RAM, 8 core processor, RAID 1 SSDs for disk
This section contains important information about integrating SUSE Enterprise Storage 6 with other SUSE products.
SUSE Manager and SUSE Enterprise Storage are not integrated, therefore SUSE Manager cannot currently manage a SUSE Enterprise Storage cluster.
Admin Node is a Ceph cluster node where the Salt master service is running. The admin node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph Dashboard Web UI with the Grafana dashboard backed by the Prometheus monitoring toolkit.
In case of the Admin Node failure, you usually need to provide a new working hardware for the node and restore the complete cluster configuration stack from a recent backup. Such method is time consuming and causes cluster outage.
To prevent the Ceph cluster performance downtime caused by the admin node failure, we recommend to make use of the High Availability (HA) cluster for the Ceph admin node.
The idea of an HA cluster is that in case of one cluster node failure, the other node automatically takes over its role including the virtualized Admin Node. This way other Ceph cluster nodes do not notice that the Admin Node failed.
The minimal HA solution for the Admin Node requires the following hardware:
Two bare metal servers able to run SUSE Linux Enterprise with the High Availability extension and virtualize the Admin Node.
Two or more redundant network communication paths, for example via Network Device Bonding.
Shared storage to host the disk image(s) of the Admin Node virtual machine. The shared storage needs to be accessible form both servers. It can be for example an NFS export, a Samba share, or iSCSI target.
Find more details on the cluster requirements in https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/sec_ha_inst_quick_req.html.
The following procedure summarizes the most important steps of building the HA cluster for virtualizing the Admin Node. For details, refer to the indicated links.
Set up a basic 2-node HA cluster with shared storage as described in https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/art_sleha_install_quick.html.
On both cluster nodes, install all packages required for running the KVM
hypervisor and the libvirt
toolkit as described in
https://www.suse.com/documentation/sles-15/book_virt/data/sec_vt_installation_kvm.html.
On the first cluster node, create a new KVM virtual machine (VM) making
use of libvirt
as described in
https://www.suse.com/documentation/sles-15/book_virt/data/sec_libvirt_inst_vmm.html.
Use the preconfigured shared storage to store the disk images of the VM.
After the VM setup is complete, export its configuration to an XML file on the shared storage. Use the following syntax:
root #
virsh dumpxml VM_NAME > /path/to/shared/vm_name.xml
Create a resource for the Admin Node VM. Refer to https://www.suse.com/documentation/sle-ha-15/book_sleha_guide/data/cha_conf_hawk2.html for general info on creating HA resources. Detailed info on creating resource for a KVM virtual machine is described in http://www.linux-ha.org/wiki/VirtualDomain_%28resource_agent%29.
On the newly created VM guest, deploy the Admin Node including the additional services you need there. Follow relevant steps in Section 4.3, “Cluster Deployment”. At the same time, deploy the remaining Ceph cluster nodes on the non-HA cluster servers.
Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:
This chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note that version 5.5 is basically 5 with all latest patches applied.
You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Admin Node by default. You can perform the former by modifying the pillar updated after Stage 2, whi…
Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:
Salt minions are the nodes controlled by a dedicated node called Salt master. Salt minions have roles, for example Ceph OSD, Ceph Monitor, Ceph Manager, Object Gateway, iSCSI Gateway, or NFS Ganesha.
A Salt master runs its own Salt minion. It is required for running privileged tasks—for example creating, authorizing, and copying keys to minions—so that remote minions never need to run privileged tasks.
You will get the best performance from your Ceph cluster when each role is deployed on a separate node. But real deployments sometimes require sharing one node for multiple roles. To avoid troubles with performance and upgrade procedure, do not deploy the Ceph OSD, Metadata Server, or Ceph Monitor role to the Admin Node.
Salt minions need to correctly resolve the Salt master's host name over the
network. By default, they look for the salt
host
name, but you can specify any other network-reachable host name in the
/etc/salt/minion
file, see
Section 4.3, “Cluster Deployment”.
In the release notes you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:
your hardware needs special considerations.
any used software packages have changed significantly.
special precautions are necessary for your installation.
The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.
After having installed the package release-notes-ses,
find the release notes locally in the directory
/usr/share/doc/release-notes
or online at
https://www.suse.com/releasenotes/.
The goal of DeepSea is to save the administrator time and confidently perform complex operations on a Ceph cluster.
Ceph is a very configurable software solution. It increases both the freedom and responsibility of system administrators.
The minimal Ceph setup is good for demonstration purposes, but does not show interesting features of Ceph that you can see with a big number of nodes.
DeepSea collects and stores data about individual servers, such as addresses and device names. For a distributed storage system such as Ceph, there can be hundreds of such items to collect and store. Collecting the information and entering the data manually into a configuration management tool is exhausting and error prone.
The steps necessary to prepare the servers, collect the configuration, and configure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement.
DeepSea addresses these observations with the following strategy: DeepSea consolidates the administrator's decisions in a single file. The decisions include cluster assignment, role assignment and profile assignment. And DeepSea collects each set of tasks into a simple goal. Each goal is a stage:
Stage 0—the preparation— during this stage, all required updates are applied and your system may be rebooted.
If the Admin Node reboots during stage.0 to load the new kernel version, you need to run stage.0 again, otherwise minions will not be targeted.
Stage 1—the discovery—here all hardware in your cluster is being detected and necessary information for the Ceph configuration is being collected. For details about configuration, refer to Section 4.5, “Configuration and Customization”.
Stage 2—the configuration—you need to prepare configuration data in a particular format.
Stage 3—the deployment—creates a basic Ceph cluster with mandatory Ceph services. See Section 1.2.3, “Ceph Nodes and Daemons” for their list.
Stage 4—the services—additional features of Ceph like iSCSI, Object Gateway and CephFS can be installed in this stage. Each is optional.
Stage 5—the removal stage. This stage is not mandatory and during the initial setup it is usually not needed. In this stage the roles of minions and also the cluster configuration are removed. You need to run this stage when you need to remove a storage node from your cluster. For details refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.3 “Removing and Reinstalling Cluster Nodes”.
Salt has several standard locations and several naming conventions used on your master node:
/srv/pillar
The directory stores configuration data for your cluster minions. Pillar is an interface for providing global configuration values to all your cluster minions.
/srv/salt/
The directory stores Salt state files (also called sls files). State files are formatted descriptions of states in which the cluster should be.
/srv/module/runners
The directory stores Python scripts known as runners. Runners are executed on the master node.
/srv/salt/_modules
The directory stores Python scripts that are called modules. The modules are applied to all minions in your cluster.
/srv/pillar/ceph
The directory is used by DeepSea. Collected configuration data are stored here.
/srv/salt/ceph
A directory used by DeepSea. It stores sls files that can be in
different formats, but each subdirectory contains sls files. Each
subdirectory contains only one type of sls file. For example,
/srv/salt/ceph/stage
contains orchestration files
that are executed by salt-run state.orchestrate
.
DeepSea commands are executed via the Salt infrastructure. When using
the salt
command, you need to specify a set of
Salt minions that the command will affect. We describe the set of the minions
as a target for the salt
command.
The following sections describe possible methods to target the minions.
You can target a minion or a group of minions by matching their names. A minion's name is usually the short host name of the node where the minion runs. This is a general Salt targeting method, not related to DeepSea. You can use globbing, regular expressions, or lists to limit the range of minion names. The general syntax follows:
root@master #
salt target example.module
If all Salt minions in your environment belong to your Ceph cluster, you
can safely substitute target with
'*'
to include all registered
minions.
Match all minions in the example.net domain (assuming the minion names are identical to their "full" host names):
root@master #
salt '*.example.net' test.ping
Match the 'web1' to 'web5' minions:
root@master #
salt 'web[1-5]' test.ping
Match both 'web1-prod' and 'web1-devel' minions using a regular expression:
root@master #
salt -E 'web1-(prod|devel)' test.ping
Match a simple list of minions:
root@master #
salt -L 'web1,web2,web3' test.ping
Match all minions in the cluster:
root@master #
salt '*' test.ping
In a heterogeneous Salt-managed environment where SUSE Enterprise Storage 6 is deployed on a subset of nodes alongside other cluster solutions, you need to mark the relevant minions by applying a 'deepsea' grain to them before running DeepSea stage.0. This way you can easily target DeepSea minions in environments where matching by the minion name is problematic.
To apply the 'deepsea' grain to a group of minions, run:
root@master #
salt target grains.append deepsea default
To remove the 'deepsea' grain from a group of minions, run:
root@master #
salt target grains.delval deepsea destructive=True
After applying the 'deepsea' grain to the relevant minions, you can target them as follows:
root@master #
salt -G 'deepsea:*' test.ping
The following command is an equivalent:
root@master #
salt -C 'G@deepsea:*' test.ping
deepsea_minions
Option #Edit source
Setting the deepsea_minions
option's target is a
requirement for DeepSea deployments. DeepSea uses it to instruct
minions during stages execution (refer to
DeepSea Stages Description for details.
To set or change the deepsea_minions
option, edit the
/srv/pillar/ceph/deepsea_minions.sls
file on the
Salt master and add or replace the following line:
deepsea_minions: target
deepsea_minions
Target
As the target for the
deepsea_minions
option, you can use any targeting
method: both
Matching the Minion Name and
Targeting with a 'deepsea' Grain.
Match all Salt minions in the cluster:
deepsea_minions: '*'
Match all minions with the 'deepsea' grain:
deepsea_minions: 'G@deepsea:*'
You can use more advanced ways to target minions using the Salt
infrastructure. The 'deepsea-minions' manual page gives you more details
about DeepSea targeting (man 7 deepsea_minions
).
The cluster deployment process has several phases. First, you need to prepare all nodes of the cluster by configuring Salt and then deploy and configure Ceph.
If you need to skip defining storage roles for OSD as described in
Section 4.5.1.2, “Role Assignment” and deploy Ceph Monitor nodes first, you
can do so by setting the DEV_ENV
variable.
It allows deploying monitors without the presence of the
role-storage/
directory, as well as deploying a Ceph
cluster with at least one storage, monitor, and
manager role.
To set the environment variable, either enable it globally by setting it in
the /srv/pillar/ceph/stack/global.yml
file, or set it
for the current shell session only:
root@master #
export DEV_ENV=true
As an example, /srv/pillar/ceph/stack/global.yml
can
be created with the following contents:
DEV_ENV: True
The following procedure describes the cluster preparation in detail.
Install and register SUSE Linux Enterprise Server 15 SP1 together with the SUSE Enterprise Storage 6 extension on each node of the cluster.
Verify that proper products are installed and registered by listing
existing software repositories. Run zypper lr -E
and
compare the output with the following list:
SLE-Product-SLES15-SP1-Pool SLE-Product-SLES15-SP1-Updates SLE-Module-Server-Applications15-SP1-Pool SLE-Module-Server-Applications15-SP1-Updates SLE-Module-Basesystem15-SP1-Pool SLE-Module-Basesystem15-SP1-Updates SUSE-Enterprise-Storage-6-Pool SUSE-Enterprise-Storage-6-Updates
Configure network settings including proper DNS name resolution on each node. The Salt master and all the Salt minions need to resolve each other by their host names. For more information on configuring a network, see https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_network_yast.html For more information on configuring a DNS server, see https://www.suse.com/documentation/sles-15/book_sle_admin/data/cha_dns.html.
Select one or more time servers/pools, and synchronize the local time
against them. Verify that the time synchronization service is enabled on
each system start-up. You can use the yast ntp-client
command found in a yast2-ntp-client package to
configure time synchronization.
Virtual machines are not reliable NTP sources.
Find more information on setting up NTP in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_ntp_yast.html.
Install the salt-master
and
salt-minion
packages on the Salt master node:
root@master #
zypper in salt-master salt-minion
Check that the salt-master
service is enabled and
started, and enable and start it if needed:
root@master #
systemctl enable salt-master.serviceroot@master #
systemctl start salt-master.service
If you intend to use firewall, verify that the Salt master node has ports
4505 and 4506 open to all Salt minion nodes. If the ports are closed, you
can open them using the yast2 firewall
command by
allowing the service.
DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running
root #
systemctl stop firewalld.service
or set the FAIL_ON_WARNING
option to 'False' in
/srv/pillar/ceph/stack/global.yml
:
FAIL_ON_WARNING: False
Install the package salt-minion
on all minion nodes.
root #
zypper in salt-minion
Make sure that the fully qualified domain name of each node can be resolved to the public network IP address by all other nodes.
Configure all minions (including the master minion) to connect to the
master. If your Salt master is not reachable by the host name
salt
, edit the file
/etc/salt/minion
or create a new file
/etc/salt/minion.d/master.conf
with the following
content:
master: host_name_of_salt_master
If you performed any changes to the configuration files mentioned above, restart the Salt service on all Salt minions:
root@minion >
systemctl restart salt-minion.service
Check that the salt-minion
service is enabled and
started on all nodes. Enable and start it if needed:
root #
systemctl enable salt-minion.serviceroot #
systemctl start salt-minion.service
Verify each Salt minion's fingerprint and accept all salt keys on the Salt master if the fingerprints match.
If the Salt minion fingerprint comes back empty, make sure the Salt minion has a Salt master configuration and it can communicate with the Salt master.
View each minion's fingerprint:
root@minion >
salt-call --local key.finger
local:
3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...
After gathering fingerprints of all the Salt minions, list fingerprints of all unaccepted minion keys on the Salt master:
root@master #
salt-key -F
[...]
Unaccepted Keys:
minion1:
3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...
If the minions' fingerprint match, accept them:
root@master #
salt-key --accept-all
Verify that the keys have been accepted:
root@master #
salt-key --list-all
Prior to deploying SUSE Enterprise Storage 6, manually zap all the disks. Remember to replace 'X' with the correct disk letter:
Stop all processes that are using the specific disk.
Verify whether any partition on the disk is mounted, and unmount if needed.
If the disk is managed by LVM, deactivate and delete the whole LVM infrastructure. Refer to https://www.suse.com/documentation/sles-15/book_storage/data/cha_lvm.html for more details.
If the disk is part of MD RAID, deactivate the RAID. Refer to https://www.suse.com/documentation/sles-15/book_storage/data/part_software_raid.html for more details.
If you get error messages such as 'partition in use' or 'kernel can not be updated with the new partition table' during the following steps, reboot the server.
Wipe the beginning of each partition (as root
):
for partition in /dev/sdX[0-9]* do dd if=/dev/zero of=$partition bs=4096 count=1 oflag=direct done
Wipe the beginning of the drive:
root #
dd if=/dev/zero of=/dev/sdX bs=512 count=34 oflag=direct
Wipe the end of the drive:
root #
dd if=/dev/zero of=/dev/sdX bs=512 count=33 \
seek=$((`blockdev --getsz /dev/sdX` - 33)) oflag=direct
Verify drive is empty (with no GPT structures) using:
root #
parted -s /dev/sdX print free
or
root #
dd if=/dev/sdX bs=512 count=34 | hexdump -Croot #
dd if=/dev/sdX bs=512 count=33 \ skip=$((`blockdev --getsz /dev/sdX` - 33)) | hexdump -C
Optionally, if you need to preconfigure the cluster's network settings
before the deepsea package is installed, create
/srv/pillar/ceph/stack/ceph/cluster.yml
manually and
set the cluster_network:
and
public_network:
options. Note that the file will not be
overwritten after you install deepsea.
If you need to enable IPv6 network addressing, refer to Section 6.2.1, “Enabling IPv6 for Ceph Cluster Deployment”
Install DeepSea on the Salt master node:
root@master #
zypper in deepsea
The value of the master_minion
parameter is dynamically
derived from the /etc/salt/minion_id
file on the
Salt master. If you need to override the discovered value, edit the file
/srv/pillar/ceph/stack/global.yml
and set a relevant
value:
master_minion: MASTER_MINION_NAME
If your Salt master is reachable via more host names, use the Salt minion name
for the storage cluster as returned by the salt-key -L
command. If you used the default host name for your
Salt master—salt—in the
ses domain, then the file looks as follows:
master_minion: salt.ses
Now you deploy and configure Ceph. Unless specified otherwise, all steps are mandatory.
There are two possible ways how to run salt-run
state.orch
—one is with
'stage.STAGE_NUMBER', the other is with the name
of the stage. Both notations have the same impact and it is fully your
preference which command you use.
Ensure the Salt minions belonging to the Ceph cluster are correctly
targeted through the deepsea_minions
option in
/srv/pillar/ceph/deepsea_minions.sls
. Refer to
Section 4.2.2.3, “Set the deepsea_minions
Option” for more information.
By default, DeepSea deploys Ceph clusters with tuned profiles active
on Ceph Monitor, Ceph Manager, and Ceph OSD nodes. In some cases, you may need to deploy
without tuned profiles. To do so, put the following lines in
/srv/pillar/ceph/stack/global.yml
before running
DeepSea stages:
alternative_defaults: tuned_mgr_init: default-off tuned_mon_init: default-off tuned_osd_init: default-off
Optional: create Btrfs sub-volumes for
/var/lib/ceph/
. This step should only be executed
before DeepSea stages. To migrate existing directories or for more
details, see Book “Administration Guide”, Chapter 23 “Hints and Tips”, Section 23.6 “Btrfs Subvolume for /var/lib/ceph
on Ceph Monitor Nodes”.
root@master #
salt 'MON_NODES' state.apply ceph.subvolume
Prepare your cluster. Refer to DeepSea Stages Description for more details.
root@master #
salt-run state.orch ceph.stage.0
or
root@master #
salt-run state.orch ceph.stage.prep
Using the DeepSea CLI, you can follow the stage execution progress in real-time, either by running the DeepSea CLI in the monitoring mode, or by running the stage directly through DeepSea CLI. For details refer to Section 4.4, “DeepSea CLI”.
The discovery stage collects data from all minions and creates
configuration fragments that are stored in the directory
/srv/pillar/ceph/proposals
. The data are stored in
the YAML format in *.sls or *.yml files.
Run the following command to trigger the discovery stage:
root@master #
salt-run state.orch ceph.stage.1
or
root@master #
salt-run state.orch ceph.stage.discovery
After the previous command finishes successfully, create a
policy.cfg
file in
/srv/pillar/ceph/proposals
. For details refer to
Section 4.5.1, “The policy.cfg
File”.
If you need to change the cluster's network setting, edit
/srv/pillar/ceph/stack/ceph/cluster.yml
and adjust
the lines starting with cluster_network:
and
public_network:
.
The configuration stage parses the policy.cfg
file
and merges the included files into their final form. Cluster and role
related content are placed in
/srv/pillar/ceph/cluster
, while Ceph specific
content is placed in /srv/pillar/ceph/stack/default
.
Run the following command to trigger the configuration stage:
root@master #
salt-run state.orch ceph.stage.2
or
root@master #
salt-run state.orch ceph.stage.configure
The configuration step may take several seconds. After the command
finishes, you can view the pillar data for the specified minions (for
example, named ceph_minion1
,
ceph_minion2
, etc.) by running:
root@master #
salt 'ceph_minion*' pillar.items
If you want to modify the default OSD's layout and change the drive groups configuration, follow the procedure described in Section 4.5.2, “DriveGroups”.
As soon as the command finishes, you can view the default configuration and change it to suit your needs. For details refer to Chapter 6, Customizing the Default Configuration.
Now you run the deployment stage. In this stage, the pillar is validated, and the Ceph Monitor and Ceph OSD daemons are started:
root@master #
salt-run state.orch ceph.stage.3
or
root@master #
salt-run state.orch ceph.stage.deploy
The command may take several minutes. If it fails, you need to fix the issue and run the previous stages again. After the command succeeds, run the following to check the status:
cephadm >
ceph -s
The last step of the Ceph cluster deployment is the services stage. Here you instantiate any of the currently supported services: iSCSI Gateway, CephFS, Object Gateway, and NFS Ganesha. In this stage, the necessary pools, authorizing keyrings, and starting services are created. To start the stage, run the following:
root@master #
salt-run state.orch ceph.stage.4
or
root@master #
salt-run state.orch ceph.stage.services
Depending on the setup, the command may run for several minutes.
Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.2 “Telemetry Module” for information and instructions.
DeepSea also provides a command line interface (CLI) tool that allows the
user to monitor or run stages while visualizing the execution progress in
real-time. Verify that the deepsea-cli package is
installed before you run the deepsea
executable.
Two modes are supported for visualizing a stage's execution progress:
Monitoring mode: visualizes the execution
progress of a DeepSea stage triggered by the salt-run
command issued in another terminal session.
Stand-alone mode: runs a DeepSea stage while providing real-time visualization of its component steps as they are executed.
The DeepSea CLI commands can only be run on the Salt master node with the
root
privileges.
The progress monitor provides a detailed, real-time visualization of what
is happening during execution of stages using salt-run
state.orch
commands in other terminal sessions.
You need to start the monitor in a new terminal window
before running any salt-run
state.orch
so that the monitor can detect the start of the
stage's execution.
If you start the monitor after issuing the salt-run
state.orch
command, then no execution progress will be shown.
You can start the monitor mode by running the following command:
root@master #
deepsea monitor
For more information about the available command line options of the
deepsea monitor
command check its manual page:
cephadm >
man deepsea-monitor
In the stand-alone mode, DeepSea CLI can be used to run a DeepSea stage, showing its execution in real-time.
The command to run a DeepSea stage from the DeepSea CLI has the following form:
root@master #
deepsea stage run stage-name
where stage-name corresponds to the way Salt
orchestration state files are referenced. For example, stage
deploy, which corresponds to the directory
located in /srv/salt/ceph/stage/deploy
, is referenced
as ceph.stage.deploy.
This command is an alternative to the Salt-based commands for running DeepSea stages (or any DeepSea orchestration state file).
The command deepsea stage run ceph.stage.0
is equivalent
to salt-run state.orch ceph.stage.0
.
For more information about the available command line options accepted by
the deepsea stage run
command check its manual page:
root@master #
man deepsea-stage run
In the following figure shows an example of the output of the DeepSea CLI when running Stage 2:
stage run
Alias #Edit source
For advanced users of Salt, we also support an alias for running a
DeepSea stage that takes the Salt command used to run a stage, for
example, salt-run state.orch
stage-name
, as a command of the
DeepSea CLI.
Example:
root@master #
deepsea salt-run state.orch stage-name
policy.cfg
File #Edit source
The /srv/pillar/ceph/proposals/policy.cfg
configuration file is used to determine roles of individual cluster nodes.
For example, which nodes act as Ceph OSDs or Ceph Monitors. Edit
policy.cfg
in order to reflect your desired cluster
setup. The order of the sections is arbitrary, but the content of included
lines overwrites matching keys from the content of previous lines.
policy.cfg
You can find several examples of complete policy files in the
/usr/share/doc/packages/deepsea/examples/
directory.
In the cluster section you select minions for your cluster. You can select all minions, or you can blacklist or whitelist minions. Examples for a cluster called ceph follow.
To include all minions, add the following lines:
cluster-ceph/cluster/*.sls
To whitelist a particular minion:
cluster-ceph/cluster/abc.domain.sls
or a group of minions—you can shell glob matching:
cluster-ceph/cluster/mon*.sls
To blacklist minions, set the them to
unassigned
:
cluster-unassigned/cluster/client*.sls
This section provides you with details on assigning 'roles' to your
cluster nodes. A 'role' in this context means the service you need to run
on the node, such as Ceph Monitor, Object Gateway, or iSCSI Gateway. No role is assigned
automatically, only roles added to policy.cfg
will be
deployed.
The assignment follows this pattern:
role-ROLE_NAME/PATH/FILES_TO_INCLUDE
Where the items have the following meaning and values:
ROLE_NAME is any of the following: 'master', 'admin', 'mon', 'mgr', 'storage', 'mds', 'igw', 'rgw', 'ganesha', 'grafana', or 'prometheus'.
PATH is a relative directory path to .sls or
.yml files. In case of .sls files, it usually is
cluster
, while .yml files are located at
stack/default/ceph/minions
.
FILES_TO_INCLUDE are the Salt state files
or YAML configuration files. They normally consist of Salt minions host
names, for example ses5min2.yml
. Shell globbing can
be used for more specific matching.
An example for each role follows:
master - the node has admin keyrings to all Ceph clusters. Currently, only a single Ceph cluster is supported. As the master role is mandatory, always add a similar line to the following:
role-master/cluster/master*.sls
admin - the minion will have an admin keyring. You define the role as follows:
role-admin/cluster/abc*.sls
mon - the minion will provide the monitor service to the Ceph cluster. This role requires addresses of the assigned minions. Since SUSE Enterprise Storage 5, the public address are calculated dynamically and are no longer needed in the Salt pillar.
role-mon/cluster/mon*.sls
The example assigns the monitor role to a group of minions.
mgr - the Ceph manager daemon which collects all the state information from the whole cluster. Deploy it on all minions where you plan to deploy the Ceph monitor role.
role-mgr/cluster/mgr*.sls
storage - use this role to specify storage nodes.
role-storage/cluster/data*.sls
mds - the minion will provide the metadata service to support CephFS.
role-mds/cluster/mds*.sls
igw - the minion will act as an iSCSI Gateway. This role
requires addresses of the assigned minions, thus you need to also
include the files from the stack
directory:
role-igw/cluster/*.sls
rgw - the minion will act as an Object Gateway:
role-rgw/cluster/rgw*.sls
ganesha - the minion will act as an NFS Ganesha server. The 'ganesha' role requires either an 'rgw' or 'mds' role in cluster, otherwise the validation will fail in Stage 3.
role-ganesha/cluster/ganesha*.sls
To successfully install NFS Ganesha, additional configuration is required. If you want to use NFS Ganesha, read Chapter 11, Installation of NFS Ganesha before executing stages 2 and 4. However, it is possible to install NFS Ganesha later.
In some cases it can be useful to define custom roles for NFS Ganesha nodes. For details, see Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”, Section 19.3 “Custom NFS Ganesha Roles”.
grafana, prometheus - this node adds Grafana charts based on Prometheus alerting to the Ceph Dashboard. Refer to Book “Administration Guide”, Chapter 20 “Ceph Dashboard” for its detailed description.
role-grafana/cluster/grafana*.sls
role-prometheus/cluster/prometheus*.sls
You can assign several roles to a single node. For example, you can assign the 'mds' roles to the monitor nodes:
role-mds/cluster/mon[1,2]*.sls
The common configuration section includes configuration files generated
during the discovery (Stage 1). These configuration
files store parameters like fsid
or
public_network
. To include the required Ceph common
configuration, add the following lines:
config/stack/default/global.yml config/stack/default/ceph/cluster.yml
Sometimes it is not practical to include all files from a given directory
with *.sls globbing. The policy.cfg
file parser
understands the following filters:
This section describes filtering techniques for advanced users. When not used correctly, filtering can cause problems for example in case your node numbering changes.
Use the slice filter to include only items start
through end-1. Note that items in the given
directory are sorted alphanumerically. The following line includes the
third to fifth files from the role-mon/cluster/
subdirectory:
role-mon/cluster/*.sls slice[3:6]
Use the regular expression filter to include only items matching the given expressions. For example:
role-mon/cluster/mon*.sls re=.*1[135]\.subdomainX\.sls$
policy.cfg
File #Edit source
Following is an example of a basic policy.cfg
file:
## Cluster Assignment cluster-ceph/cluster/*.sls 1 ## Roles # ADMIN role-master/cluster/examplesesadmin.sls 2 role-admin/cluster/sesclient*.sls 3 # MON role-mon/cluster/ses-example-[123].sls 4 # MGR role-mgr/cluster/ses-example-[123].sls 5 # STORAGE role-storage/cluster/ses-example-[5,6,7,8].sls 6 # MDS role-mds/cluster/ses-example-4.sls 7 # IGW role-igw/cluster/ses-example-4.sls 8 # RGW role-rgw/cluster/ses-example-4.sls 9 # COMMON config/stack/default/global.yml 10 config/stack/default/ceph/cluster.yml 11
Indicates that all minions are included in the Ceph cluster. If you have minions you do not want to include in the Ceph cluster, use: cluster-unassigned/cluster/*.sls cluster-ceph/cluster/ses-example-*.sls The first line marks all minions as unassigned. The second line overrides minions matching 'ses-example-*.sls', and assigns them to the Ceph cluster. | |
The minion called 'examplesesadmin' has the 'master' role. This, by the way, means it will get admin keys to the cluster. | |
All minions matching 'sesclient*' will get admin keys as well. | |
All minions matching 'ses-example-[123]' (presumably three minions: ses-example-1, ses-example-2, and ses-example-3) will be set up as MON nodes. | |
All minions matching 'ses-example-[123]' (all MON nodes in the example) will be set up as MGR nodes. | |
All minions matching 'ses-example-[5,6,7,8]' will be set up as storage nodes. | |
Minion 'ses-example-4' will have the MDS role. | |
Minion 'ses-example-4' will have the IGW role. | |
Minion 'ses-example-4' will have the RGW role. | |
Means that we accept the default values for common configuration
parameters such as | |
Means that we accept the default values for common configuration
parameters such as |
DriveGroups specify the layouts of OSD's in the Ceph
cluster. They are defined in a single file
/srv/salt/ceph/configuration/files/drive_groups.yml
.
An administrator should manually specify a group of OSDs that are
interrelated (hybrid OSDs that are deployed on solid state and spinners) or
share the same deployment options (identical, for example same objectstore,
same encryption option, stand-alone OSDs). To avoid explicitly listing
devices, DriveGroups use a list of filter items that correspond to a few
selected fields of ceph-volume
's inventory reports. In
the simplest case this could be the 'rotational' flag (all solid-state
drives are to be db_devices, all rotating one data devices) or something
more involved such as 'model' strings, or sizes. DeepSea will provide
code that translates these DriveGroups into actual device lists for
inspection by the user.
Following is a simple procedure that demonstrates the basic workflow when configuring DriveGroups:
Inspect your disks' properties as seen by the
ceph-volume
command. Only these properties are
accepted by DriveGroups:
root@master #
salt-run disks.details
Open the
/srv/salt/ceph/configuration/files/drive_groups.yml
YAML file and adjust to your needs. Refer to
Section 4.5.2.1, “Specification”. Remember to use spaces instead
of tabs. Find more advanced examples in
Section 4.5.2.4, “Examples”. The following example
includes all drives available to Ceph as OSD's:
default_drive_group_name: target: '*' data_devices: all: true
Verify new layouts:
root@master #
salt-run disks.list
This runner returns you a structure of matching disks based on your drive groups. If you are not happy with the result, repeat the previous step.
In addition to the disks.list
runner, there is a
disks.report
runner that prints out a detailed report
of what will happen in the next DeepSea stage.3 invocation.
root@master #
salt-run disks.report
Deploy OSD's. On the next DeepSea stage.3 invocation, the OSD disks will be deployed according to your drive group specification.
/srv/salt/ceph/configuration/files/drive_groups.yml
accepts the following options:
drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION db_devices: drive_spec: DEVICE_SPECIFICATION wal_devices: drive_spec: DEVICE_SPECIFICATION block_wal_size: '5G' # (optional, unit suffixes permitted) block_db_size: '5G' # (optional, unit suffixes permitted) osds_per_device: 1 # number of osd daemons per device format: # 'bluestore' or 'filestore' (defaults to 'bluestore') encryption: # 'True' or 'False' (defaults to 'False')
For FileStore setups, drive_groups.yml
can be as
follows:
drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION journal_devices: drive_spec: DEVICE_SPECIFICATION format: filestore encryption: True
You can describe the specification using the following filters:
By a disk model:
model: DISK_MODEL_STRING
By a disk vendor:
vendor: DISK_VENDOR_STRING
Always lowercase the DISK_VENDOR_STRING.
Whether a disk is rotational or not. SSD'd and NVME's are not rotational.
rotational: 0
Deploy a node using all available drives for OSD's:
data_devices: all: true
Additionally, by limiting the number of matching disks:
limit: 10
You can filter disk devices by their size—either by an exact size,
or a size range. The size:
parameter accepts arguments in
the following form:
'10G' - Includes disks of an exact size.
'10G:40G' - Includes disks which size is within the range.
':10G' - Includes disks less than or equal to 10G in size.
'40G:' - Includes disks equal to or greater than 40G in size.
drive_group_default: target: '*' data_devices: size: '40TB:' db_devices: size: ':2TB'
When using the ':' delimiter, you need to enclose the size in quotes, otherwise the ':' sign will be interpreted as a new configuration hash.
Instead of (G)igabytes, you can specify the sizes in (M)egabytes or (T)errabytes as well.
This section includes examples of different OSD setups.
This example describes two nodes with the same setup:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
The corresponding drive_groups.yml
file will be as
follows:
drive_group_default: target: '*' data_devices: model: SSD-123-foo db_devices: model: MC-55-44-XZ
Such configuration is simple and valid. The problem is that an administrator may add disks of different vendors in the future, and these will not be included. You can improve it by reducing the filters on core properties of the drives:
drive_group_default: target: '*' data_devices: rotational: 1 db_devices: rotational: 0
In the previous example, we are enforcing all rotating devices to be declared as 'data devices' and all non-rotating devices will be used as 'shared devices' (wal, db).
If you know that drives with more than 2TB will always be the slower data devices, you can filter by size:
drive_group_default: target: '*' data_devices: size: '2TB:' db_devices: size: ':2TB'
This example describes two distinct setups: 20 HDDs should share 2 SSDs, while 10 SSDs should share 2 NVMes.
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
12 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
2 NVMEs
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256GB
Such setup can be defined with two layouts as follows:
drive_group: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ
drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: vendor: samsung size: 256GB
The previous examples assumed that all nodes have the same drives. However, that is not always the case:
Node 1-5:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
Node 6-10:
5 NVMEs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
20 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
You can use the 'target' key in the layout to target specific nodes. Salt target notation helps to keep things simple:
drive_group_node_one_to_five: target: 'node[1-5]' data_devices: rotational: 1 db_devices: rotational: 0
followed by
drive_group_the_rest: target: 'node[6-10]' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo
All previous cases assumed that the WALs and DBs use the same device. It is however possible to deploy the WAL on a dedicated device as well:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
2 NVMEs
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256GB
drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo wal_devices: model: NVME-QQQQ-987
In the following setup, we are trying to define:
20 HDDs backed by 1 NVME
2 HDDs backed by 1 SSD(db) and 1 NVME(wal)
8 SSDs backed by 1 NVME
2 SSDs stand-alone (encrypted)
1 HDD is spare and should not be deployed
The summary of used drives follows:
23 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4TB
10 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512GB
1 NVMEs
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256GB
The DriveGroups definition will be following:
drive_group_hdd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: NVME-QQQQ-987
drive_group_hdd_ssd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ wal_devices: model: NVME-QQQQ-987
drive_group_ssd_nvme: target: '*' data_devices: model: SSD-123-foo db_devices: model: NVME-QQQQ-987
drive_group_ssd_standalone_encrypted: target: '*' data_devices: model: SSD-123-foo encryption: True
One HDD will remain as the file is being parsed from top to bottom.
ceph.conf
with Custom Settings #Edit source
If you need to put custom settings into the ceph.conf
configuration file, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.13 “Adjusting ceph.conf
with Custom Settings” for more
details.
ntpd
to chronyd
policy.cfg
and Deploy Ceph Dashboard using DeepSeaThis chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note that version 5.5 is basically 5 with all latest patches applied.
The upgrade from SUSE Enterprise Storage version older than 5.5 is not supported. You first need to upgrade to the latest version of SUSE Enterprise Storage 5.5 and then follow steps in this chapter.
Read the release notes - there you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:
your hardware needs special considerations.
any used software packages have changed significantly.
special precautions are necessary for your installation.
The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.
After having installed the package release-notes-ses,
find the release notes locally in the directory
/usr/share/doc/release-notes
or online at
https://www.suse.com/releasenotes/.
In case you previously upgraded from version 4, verify that the upgrade to version 5 was completed successfully:
Check for the existence of the file
/srv/salt/ceph/configuration/files/ceph.conf.import
It is created by the engulf process during the upgrade from SES 4 to 5.
Also, the configuration_init: default-import
option is
set in the file
/srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml
If configuration_init
is still set to
default-import
, the cluster is using
ceph.conf.import
as its configuration file and not
the DeepSea's default ceph.conf
which is compiled
from files in
/srv/salt/ceph/configuration/files/ceph.conf.d/
Therefore you need to inspect ceph.conf.import
for
any custom configuration, and possibly move the configuration to one of
the files in
/srv/salt/ceph/configuration/files/ceph.conf.d/
Then remove the configuration_init: default-import
line
from
/srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml
If you do not merge the configuration
from ceph.conf.import
and remove the
configuration_init: default-import
option, any default
configuration settings we ship as part of DeepSea (stored in
/srv/salt/ceph/configuration/files/ceph.conf.j2
)
will not be applied to the cluster.
Check if the cluster uses the new bucket type 'straw2':
cephadm >
ceph osd crush dump | grep straw
Check that Ceph 'jewel' profile is used:
cephadm >
ceph osd crush dump | grep profile
In case old RBD kernel clients (older than SUSE Linux Enterprise Server 12 SP3) are being used, refer to Book “Administration Guide”, Chapter 10 “RADOS Block Device”, Section 10.9 “Mapping RBD using Old Kernel Clients”. We recommend upgrading old RBD kernel clients if possible.
If openATTIC is located on the Admin Node, it will be unavailable after you upgrade the node. The new Ceph Dashboard will not be available until you deploy it by using DeepSea.
The cluster upgrade may take a long time—approximately the time it takes to upgrade one machine multiplied by the number of cluster nodes.
A single node cannot be upgraded while running the previous SUSE Linux Enterprise Server release, but needs to be rebooted into the new version's installer. Therefore the services that the node provides will be unavailable for some time. The core cluster services will still be available—for example if one MON is down during upgrade, there are still at least two active MONs. Unfortunately, single instance services, such as a single iSCSI Gateway, will be unavailable.
Certain types of daemons depend upon others. For example Ceph Object Gateways depend upon Ceph MON and OSD daemons. We recommend upgrading in this order:
Admin Node
Ceph Monitors/Ceph Managers
Metadata Servers
Ceph OSDs
Object Gateways
iSCSI Gateways
NFS Ganesha
Samba Gateways
If you used AppArmor in either 'complain' or 'enforce' mode, you need to set a
Salt pillar variable before upgrading. Because SUSE Linux Enterprise Server 15 SP1 ships with AppArmor by
default, AppArmor management was integrated into DeepSea stage.0. The
default behavior in SUSE Enterprise Storage 6 is to remove AppArmor and
related profiles. If you want to retain the behavior configured in
SUSE Enterprise Storage 5.5, verify that one of the following lines
is present in the /srv/pillar/ceph/stack/global.yml
file before starting the upgrade:
apparmor_init: default-enforce
or
apparmor_init: default-complain
Since SUSE Enterprise Storage 6, MDS names starting with a digit are
no longer allowed and MDS daemons will refuse to start. You can check
whether your daemons have such names either by running the ceph
fs status
command, or by restarting an MDS and checking its logs
for the following message:
deprecation warning: MDS id 'mds.1mon1' is invalid and will be forbidden in a future version. MDS names may not start with a numeric digit.
If you see the above message, the MDS names will need to be migrated before attempting to upgrade to SUSE Enterprise Storage 6. DeepSea provides an orchestration to automate such migration. MDS names starting with a digit will be prepended with 'mds.':
root@master #
salt-run state.orch ceph.mds.migrate-numerical-names
If you have configuration settings that are bound to MDS names and your
MDS daemons have names starting with a digit, verify that your
configuration settings apply to the new names as well (with the 'mds.'
prefix). Consider the following example section in the
/etc/ceph/ceph.conf
file:
[mds.123-my-mds] # config setting specific to MDS name with a name starting with a digit mds cache memory limit = 1073741824 mds standby for name = 456-another-mds
The ceph.mds.migrate-numerical-names
orchestrator will
change the MDS daemon name '123-my-mds' to 'mds.123-my-mds'. You need to
adjust the configuration to reflect the new name:
[mds.mds,123-my-mds] # config setting specific to MDS name with the new name mds cache memory limit = 1073741824 mds standby for name = mds.456-another-mds
This will add MDS daemons with the new names before removing the old MDS daemons. The number of MDS daemons will double for a short time. Clients will be able to access CephFS with a short pause to failover. Therefore plan the migration for times when you expect little or no CephFS load.
Although creating backups of cluster's configuration and data is not mandatory, we strongly recommend backing up important configuration files and cluster data. Refer to Book “Administration Guide”, Chapter 2 “Backing Up Cluster Configuration and Data” for more details.
ntpd
to chronyd
#Edit source
SUSE Linux Enterprise Server 15 SP1 no longer uses ntpd
to
synchronize the local host time. Instead,
chronyd
is used. You need to migrate
the time synchronization daemon on each cluster node. You can migrate to
chronyd
either
before the cluster, or upgrade the cluster
and migrate to chronyd
afterward.
chronyd
before the Cluster Upgrade #Install the chrony package:
root@minion >
zypper install chrony
Edit the chronyd
configuration
file /etc/chrony.conf
and add NTP sources from the
current ntpd
configuration in
/etc/ntp.conf
.
Disable and stop the ntpd
service:
root@master #
systemctl disable ntpd.service && systemctl stop ntpd.service
Start and enable the chronyd
service:
root@master #
systemctl start chronyd.service && systemctl enable chronyd.service
Verify the status of chrony:
root@master #
chronyc tracking
chronyd
after the Cluster Upgrade #During cluster upgrade, add the following software repositories:
SLE-Module-Legacy15-SP1-Pool
SLE-Module-Legacy15-SP1-Updates
Upgrade the cluster to version 6.
Edit the chronyd
configuration
file /etc/chrony.conf
and add NTP sources from the
current ntpd
configuration in
/etc/ntp.conf
.
chronyd
Configuration
Refer to
https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html
to find more details about how to include time sources in
chronyd
configuration.
Disable and stop the ntpd
service:
root@master #
systemctl disable ntpd.service && systemctl stop ntpd.service
Start and enable the chronyd
service:
root@master #
systemctl start chronyd.service && systemctl enable chronyd.service
Migrate from ntpd
to
chronyd
.
Verify the status of chrony:
root@master #
chronyc tracking
Remove the legacy software repositories that you added to keep
ntpd
in the system during the
upgrade process.
Apply the latest patches to all cluster nodes prior to upgrade.
Check that required repositories are configured on each cluster's node. To list all available repositories, run
root@minion >
zypper lr
SUSE Enterprise Storage 5.5 requires:
SLES12-SP3-Installer-Updates
SLES12-SP3-Pool
SLES12-SP3-Updates
SUSE-Enterprise-Storage-5-Pool
SUSE-Enterprise-Storage-5-Updates
NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 12 SP3 requires:
SLE-HA12-SP3-Pool
SLE-HA12-SP3-Updates
If you are using one of repository staging systems—SMT, RMT;, or SUSE Manager—create a new frozen patch level for the current and the new SUSE Enterprise Storage version.
Find more information in:
Apply latest patches of SUSE Enterprise Storage 5.5 and SUSE Linux Enterprise Server 12 SP3 to each Ceph cluster node. Verify that correct software repositories are connected to each cluster node (see Section 5.4.1, “Required Software Repositories”) and run DeepSea stage.0:
root@master #
salt-run state.orch ceph.stage.0
After stage.0 completes, verify that each cluster node's status includes 'HEALTH_OK'. If not, resolve the problem before possible reboots in next steps.
Run zypper ps
to check for processes that may run with
outdated libraries or binaries, and reboot if there are some.
Verify that the running kernel is the latest available and reboot if not. Check outputs of the following commands:
cephadm >
uname -acephadm >
rpm -qa kernel-default
Verify that the ceph package is version 12.2.12 or newer. Verify that the deepsea package is version 0.8.9 or newer.
If you previously used any of the bluestore_cache
settings, they are not effective any more since ceph
version 12.2.10. The new setting
bluestore_cache_autotune
that is set to 'true' by
default disables manual cache sizing. To turn on the old behavior, you
need to set bluestore_cache_autotune=false
. Refer to
Book “Administration Guide”, Chapter 14 “Ceph Cluster Configuration”, Section 14.2.1 “Automatic Cache Sizing” for details.
If the system has obvious problems, fix them before starting the upgrade. Upgrade never fixes existing system problems.
Check cluster performance. You can use commands such as rados
bench
, ceph tell osd.* bench
, or
iperf3
.
Verify access to gateways (such as iSCSI Gateway, or Object Gateway) and RADOS Block Device.
Document specific parts of the system setup, such as network setup, partitioning, or installation details.
Use supportconfig
to collect important system
information and save it outside cluster nodes. Find more information in
https://www.suse.com/documentation/sles-12/book_sle_admin/data/sec_admsupport_supportconfig.html.
Ensure there is enough free disk space on each cluster node. Check free
disk space with df -h
. When needed, free disk space by
removing unneeded files/directories or removing obsolete OS snapshots. If
there is not enough free disk space, do not continue with the upgrade
unless you free enough disk space.
Check the cluster health
command before starting the
upgrade procedure. Do not start the upgrade unless each cluster node
reports 'HEALTH_OK'.
Verify that all services are running:
Salt master and Salt master daemons.
Ceph Monitor and Ceph Manager daemons.
Metadata Server server daemons.
Ceph OSD daemons.
Object Gateway daemons.
iSCSI Gateway daemons.
The following commands provide details cluster state and specific configuration:
ceph -s
Prints a brief summary of Ceph cluster health, running services, data usage, and IO statistics. Verify that it reports 'HEALTH_OK' before starting the upgrade.
ceph health detail
Prints details if Ceph cluster health is not OK.
ceph versions
Prints versions of running Ceph daemons.
ceph df
Prints total and free disk space on the cluster. Do not start the upgrade if the cluster's free disk space is less than 25% of the total disk space.
salt '*' cephprocesses.check results=true
Prints running Ceph processes and their PID's sorted by Salt minions.
ceph osd dump | grep ^flags
Verify that 'recovery_deletes' and 'purged_snapdirs' flags are present. If not, you can force a scrub on all placement groups by running the following command. Be aware that this forced scrub may possibly have a negative impact on your Ceph clients’ performance.
cephadm >
ceph pg dump pgs_brief | cut -d " " -f 1 | xargs -n1 ceph pg scrub
CTDB provides a clustered database used by Samba Gateways. The CTDB protocol is very simple and does not support clusters of nodes communicating with different protocol versions. Therefore CTDB nodes need to be taken offline prior to performing an upgrade.
To ensure the core cluster services are available during the upgrade, you need to upgrade the cluster nodes sequentially one by one.
After a node is upgraded, a number of packages will be in an 'orphaned' state without a parent repository. This happens because python3 related packages do not obsolete python2 packages.
Find more information about listing orphaned packages in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_zypper.html#sec_zypper_softup_orphaned.
Reboot the node from the SUSE Linux Enterprise Server 15 SP1 installer DVD/image.
Select
from the boot menu.On the
screen, verify that 'SUSE Linux Enterprise Server 15 SP1' is selected and activate the check box.Select the following modules to install:
SUSE Enterprise Storage 6 x86_64
Basesystem Module 15 SP1 x86_64
Desktop Applications Module 15 SP1 x86_64
Legacy Module 15 SP1 x86_64
Server Applications Module 15 SP1 x86_64
On the
screen, verify that the correct repositories are selected. If the system is not registered with SCC/SMT, you need to add the repositories manually.SUSE Enterprise Storage 6 requires:
SLE-Module-Basesystem15-SP1-Pool
SLE-Module-Basesystem15-SP1-Updates
SLE-Module-Server-Applications15-SP1-Pool
SLE-Module-Server-Applications15-SP1-Updates
SLE-Module-Desktop-Applications15-SP1-Pool
SLE-Module-Desktop-Applications15-SP1-Updates
SLE-Product-SLES15-SP1-Pool
SLE-Product-SLES15-SP1-Updates
SLE15-SP1-Installer-Updates
SUSE-Enterprise-Storage-6-Pool
SUSE-Enterprise-Storage-6-Updates
If you intend to migrate ntpd
to
chronyd
after SES migration
(refer to Section 5.3, “Migrate from ntpd
to chronyd
”), include the following
repositories:
SLE-Module-Legacy15-SP1-Pool
SLE-Module-Legacy15-SP1-Updates
NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 15 SP1 requires:
SLE-Product-HA15-SP1-Pool
SLE-Product-HA15-SP1-Updates
Review the
and start the installation procedure by clicking .
The following commands will still work, although Salt minions are running
old version of Ceph and Salt: salt '*' test.ping
and ceph status
After the upgrade of the Admin Node, openATTIC will no longer be installed.
If the Admin Node hosted SMT, complete its migration to RMT (refer to https://www.suse.com/documentation/sles-15/book_rmt/data/cha_rmt_migrate.html).
Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
If your cluster does not use MDS roles, upgrade MON/MGR nodes one by one.
If your cluster uses MDS roles, and MON/MGR and MDS roles are co-located, you need to shrink the MDS cluster and then upgrade the co-located nodes. Refer to Section 5.11, “Upgrade Metadata Servers” for more details.
If your cluster uses MDS roles and they run on dedicated servers, upgrade all MON/MGR nodes one by one, then shrink the MDS cluster and upgrade it. Refer to Section 5.11, “Upgrade Metadata Servers” for more details.
Due to a limitation in the Ceph Monitor design, once two MONs have been upgraded to SUSE Enterprise Storage 6 and have formed a quorum, the third MON (while still on SUSE Enterprise Storage 5.5) will not rejoin the MON cluster if it restarted for any reason, including a node reboot. Therefore, when two MONs have been upgraded it is best to upgrade the rest as soon as possible.
Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
You need to shrink the Metadata Server (MDS) cluster. Because of incompatible features between the SUSE Enterprise Storage 5.5 and 6 versions, the older MDS daemons will shutdown as soon as they see a single SES 6 level MDS join the cluster. Therefore it is necessary to shrink the MDS cluster to a single active MDS (and no standby's) for the duration of the MDS node upgrades. As soon as the second node is upgraded, you can extend the MDS cluster again.
On a heavily loaded MDS cluster, you may need to to reduce the load (for example by stopping clients) so that a single active MDS is able to handle the workload.
Note the current value of the max_mds
option:
cephadm >
ceph fs get cephfs | grep max_mds
Shrink the MDS cluster if you have more then 1 active MDS daemons, i.e.
max_mds
is > 1. To shrink the MDS cluster, run
cephadm >
ceph fs set FS_NAME max_mds 1
where FS_NAME is the name of your CephFS instance ('cephfs' by default).
Find the node hosting one of standby MDS daemons. Consult the output of
the ceph fs status
command and start the upgrade of the
MDS cluster on this node.
cephadm >
ceph fs status
cephfs - 2 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | mon1-6 | Reqs: 0 /s | 13 | 16 |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 2688k | 96.8G |
| cephfs_data | data | 0 | 96.8G |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| mon3-6 |
| mon2-6 |
+-------------+
In this example, you need to start the upgrade procedure either on node 'mon3-6' or 'mon2-6'.
Upgrade the node with standby MDS daemon. After the upgraded MDS node starts, the outdated MDS daemons will shutdown automatically. At this point, clients may experience a short downtime of the CephFS service.
Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
Upgrade the remaining MDS nodes.
Reset max_mds
to the desired configuration:
root@master #
ceph fs set FS_NAME max_mds ACTIVE_MDS_COUNT
For each storage node, follow these steps:
Identify which OSD daemons are running on a particular node:
cephadm >
ceph osd tree
Set the 'noout' flag for each OSD daemon on the node that is being upgraded:
cephadm >
ceph osd add-noout osd.OSD_ID
For example:
cephadm >
for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd add-noout osd.$i; done
Verify with:
cephadm >
ceph health detail | grep noout
or
cephadm >
ceph –s
cluster:
id: 44442296-033b-3275-a803-345337dc53da
health: HEALTH_WARN
6 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
Create /etc/ceph/osd/*.json
files for all existing
OSDs by running the following command on the node that is going be
upgraded:
cephadm >
ceph-volume simple scan --force
Upgrade the OSD node. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
Activate all OSDs found in the system:
cephadm >
ceph-volume simple activate --all
If you want to activate data partitions individually, you need to find
the correct ceph-volume
command for each partition to
activate it. Replace X1 with the partition's
correct letter/number:
cephadm >
ceph-volume simple scan /dev/sdX1
For example:
cephadm >
ceph-volume simple scan /dev/vdb1
[...]
--> OSD 8 got scanned and metadata persisted to file:
/etc/ceph/osd/8-d7bd2685-5b92-4074-8161-30d146cd0290.json
--> To take over management of this scanned OSD, and disable ceph-disk
and udev, run:
--> ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290
The last line of the output contains the command to activate the partition:
cephadm >
ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290
[...]
--> All ceph-disk systemd units have been disabled to prevent OSDs
getting triggered by UDEV events
[...]
Running command: /bin/systemctl start ceph-osd@8
--> Successfully activated OSD 8 with FSID
d7bd2685-5b92-4074-8161-30d146cd0290
Verify that the OSD node will start properly after the reboot.
Address the 'Legacy BlueStore stats reporting detected on XX OSD(s)' message:
cephadm >
ceph –s
cluster:
id: 44442296-033b-3275-a803-345337dc53da
health: HEALTH_WARN
Legacy BlueStore stats reporting detected on 6 OSD(s)
The warning is normal when upgrading Ceph to 14.2.2. You can disable it by setting:
bluestore_warn_on_legacy_statfs = false
The proper fix is to run the following command on all OSDs while they are stopped:
cephadm >
ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-XXX
Following is a helper script that runs the ceph-bluestore-tool
repair
for all OSDs on the NODE_NAME
node:
OSDNODE=OSD_NODE_NAME;\ for OSD in $(ceph osd ls-tree $OSDNODE);\ do echo "osd=" $OSD;\ salt $OSDNODE cmd.run 'systemctl stop ceph-osd@$OSD';\ salt $OSDNODE cmd.run 'ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$OSD';\ salt $OSDNODE cmd.run 'systemctl start ceph-osd@$OSD';\ done
Unset the 'noout' flag for each OSD daemon on the node that is upgraded:
cephadm >
ceph osd rm-noout osd.OSD_ID
For example:
cephadm >
for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd rm-noout osd.$i; done
Verify with:
cephadm >
ceph health detail | grep noout
Note:
cephadm >
ceph –s
cluster:
id: 44442296-033b-3275-a803-345337dc53da
health: HEALTH_WARN
Legacy BlueStore stats reporting detected on 6 OSD(s)
Verify the cluster status. It will be similar to the following output:
cephadm >
ceph status
cluster:
id: e0d53d64-6812-3dfe-8b72-fd454a6dcf12
health: HEALTH_WARN
3 monitors have not enabled msgr2
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h)
mgr: mon2(active, since 22m), standbys: mon1, mon3
osd: 30 osds: 30 up, 30 in
data:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 B
usage: 31 GiB used, 566 GiB / 597 GiB avail
pgs: 1024 active+clean
When you need to find out the versions of individual cluster components and nodes—for example to find out if all your nodes are actually on the same patch level after the upgrade—you can run
root@master #
salt-run status.report
The command goes through the connected Salt minions and scans for the version numbers of Ceph, Salt, and SUSE Linux Enterprise Server, and gives you a report displaying the version that the majority of nodes have and showing nodes whose version is different from the majority.
Verify that all OSD nodes were rebooted and that OSDs started automatically after the reboot.
OSD BlueStore is a new back end for the OSD daemons. It is the default option since SUSE Enterprise Storage 5. Compared to FileStore, which stores objects as files in an XFS file system, BlueStore can deliver increased performance because it stores objects directly on the underlying block device. BlueStore also enables other features, such as built-in compression and EC overwrites, that are unavailable with FileStore.
Specifically for BlueStore, an OSD has a 'wal' (Write Ahead Log) device and a 'db' (RocksDB database) device. The RocksDB database holds the metadata for a BlueStore OSD. These two devices will reside on the same device as an OSD by default, but either can be placed on different, for example faster, media.
In SUSE Enterprise Storage 5, both FileStore and BlueStore are supported and it is possible for FileStore and BlueStore OSDs to co-exist in a single cluster. During the SUSE Enterprise Storage upgrade procedure, FileStore OSDs are not automatically converted to BlueStore. Be aware that the BlueStore-specific features will not be available on OSDs that have not been migrated to BlueStore.
Before converting to BlueStore, the OSDs need to be running SUSE Enterprise Storage 5. The conversion is a slow process as all data gets re-written twice. Though the migration process can take a long time to complete, there is no cluster outage and all clients can continue accessing the cluster during this period. However, do expect lower performance for the duration of the migration. This is caused by rebalancing and backfilling of cluster data.
Use the following procedure to migrate FileStore OSDs to BlueStore:
Salt commands needed for running the migration are blocked by safety measures. In order to turn these precautions off, run the following command:
root@master #
salt-run disengage.safety
Rebuild the nodes before continuing:
root@master #
salt-run rebuild.node TARGET
You can also choose to rebuild each node individually. For example:
root@master #
salt-run rebuild.node data1.ceph
The rebuild.node
always removes and recreates
all OSDs on the node.
If one OSD fails to convert, re-running the rebuild destroys the already converted BlueStore OSDs. Instead of re-running the rebuild, you can run:
root@master #
salt-run disks.deploy TARGET
After the migration to BlueStore, the object count will remain the same and disk usage will be nearly the same.
Upgrade application nodes in the following order:
Object Gateways
If the Object Gateways are fronted by a load balancer, then a rolling upgrade of the Object Gateways should be possible without an outage.
Validate that the Object Gateway daemons are running after each upgrade, and test with S3/Swift client.
Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
iSCSI Gateways
If iSCSI initiators are configured with multipath, then a rolling upgrade of the iSCSI Gateways should be possible without an outage.
Validate that the lrbd
daemon is
running after each upgrade, and test with initiator.
Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
NFS Ganesha. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
Samba Gateways. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.
policy.cfg
and Deploy Ceph Dashboard using DeepSea #Edit source
On the Admin Node, edit
/srv/pillar/ceph/proposals/policy.cfg
and apply the
following changes:
During cluster upgrade, do not add new services to the
policy.cfg
file. Change the cluster architecture only
after the upgrade is completed.
Remove role-openattic
.
Add role-prometheus
and role-grafana
to the node that had Prometheus and Grafana installed, usually the
Admin Node.
Role profile-PROFILE_NAME
is
now ignored. Add new corresponding role, role-storage
line. For example, for existing
profile-default/cluster/*.sls
add
role-storage/cluster/*.sls
Synchronize all Salt modules:
root@master #
salt '*' saltutil.sync_all
Update the Salt pillar by running DeepSea stage.1 and stage.2:
root@master #
salt-run state.orch ceph.stage.1root@master #
salt-run state.orch ceph.stage.2
Cleanup openATTIC:
root@master #
salt OA_MINION state.apply ceph.rescind.openatticroot@master #
salt OA_MINION state.apply ceph.remove.openattic
Unset the 'restart_igw' grain to prevent stage.0 from restarting iSCSI Gateway which is not installed yet:
Salt mastersalt '*' grains.delkey restart_igw
Finally, run through DeepSea stages 0-4:
root@master #
salt-run state.orch ceph.stage.0root@master #
salt-run state.orch ceph.stage.1root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4
DeepSea stage.3 may fail with an error similar to the following:
subvolume : ['/var/lib/ceph subvolume missing on 4510-2', \ '/var/lib/ceph subvolume missing on 4510-1', \ [...] 'See /srv/salt/ceph/subvolume/README.md']
In this case, you need to edit
/srv/pillar/ceph/stack/global.yml
and
add the following line:
subvolume_init: disabled
Then refresh the Salt pillar and re-run DeepSea stage.3:
root@master #
salt '*' saltutil.refresh_pillarroot@master #
salt-run state.orch ceph.stage.3
After DeepSea successfully finished stage.3, the Ceph Dashboard will be running. Refer to Book “Administration Guide”, Chapter 20 “Ceph Dashboard” for a detailed overview of Ceph Dashboard features.
To list node running dashboard run:
cephadm >
ceph mgr services | grep dashboard
To list admin credentials run:
root@master #
salt-call grains.get dashboard_creds
Sequentially restart the Object Gateway services to use 'beast' Web server instead of the outdated 'civetweb':
root@master #
salt-run state.orch ceph.restart.rgw.force
Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.2 “Telemetry Module” for information and instructions.
In SUSE Enterprise Storage 5.5, DeepSea offered so called 'profiles' to describe the layout of your OSDs. Starting with SUSE Enterprise Storage 6, we moved to a different approach called DriveGroups (find more details in Section 4.5.2, “DriveGroups”).
Migrating to the new approach is not immediately mandatory. Destructive
operations, such as salt-run osd.remove
,
salt-run osd.replace
, or salt-run
osd.purge
are still available. However, adding new OSDs will
require your action.
Because of the different approach of these implementations, we do not offer an automated migration path. However, we offer a variety of tools—Salt runners—to make the migration as simple as possible.
To view information about the currently deployed OSDs, use the following command:
root@master #
salt-run disks.discover
Alternatively, you can inspect the content of files in the
/srv/pillar/ceph/proposals/profile-*/
directories. The
have similar structure to the following:
ceph: storage: osds: /dev/disk/by-id/scsi-drive_name: format: bluestore /dev/disk/by-id/scsi-drive_name2: format: bluestore
Refer to Section 4.5.2.1, “Specification” for more details on DriveGroups specification.
The difference between a fresh deployment and upgrade scenario is that the drives to be migrated are already 'used'. Because
root@master #
salt-run disks.list
looks for unused disks only, use
root@master #
salt-run disks.list include_unavailable=True
Adjust DriveGroups until you match your current setup. For a more visual representation of what will be happening, use the following command. Note that it has no output if there are no free disks:
root@master #
salt-run disks.report bypass_pillar=True
If you verified that your DriveGroups are properly configured and want to
apply the new approach, remove files form the
/srv/pillar/ceph/proposals/profile-PROFILE_NAME/
directory, remove corresponding
profile-PROFILE_NAME/cluster/*.sls
lines from the /srv/pillar/ceph/proposals/policy.cfg
file, and run DeepSea stage.2 to refresh the Salt pillar.
root@master #
salt-run state.orch ceph.stage.2
Verify the result by running the following commands:
root@master #
salt target_node pillar.get ceph:storageroot@master #
salt-run disks.report
If your DriveGroups are not properly configured and there are spare disks in your setup, they will be deployed in the way you specified them. We recommend running:
root@master #
salt-run disks.report
For simple cases such as standalone OSDs, the migration will happen over-time. Whenever you remove or replace an OSD from the cluster, it will be replaced by a new, LVM based OSD.
Whenever a single 'legacy' OSD needs to be replaced on a node, all OSDs that share devices with it need to be migrated to the LVM-based format.
For completeness, consider migrating OSDs on the whole node.
If you have a more sophisticated setup than just sandalone OSDs, for
example dedicated WAL/DBs or encrypted OSDs, the migration can only happen
when all OSDs assigned to that WAL/DB device are removed. This is caused by
the ceph-volume
that creates Logical Volumes on disks
before the deployment. This prevents the user from mixing partition based
deployments with LV based deployments. In such cases it is best to manually
remove all OSDs that are assigned to a WAL/DB device and re-deploy them
using the DriveGroups approach.
You can change the default cluster configuration generated in Stage 2 (refer
to DeepSea Stages Description). For example, you may need to
change network settings, or software that is installed on the Admin Node by
default. You can perform the former by modifying the pillar updated after
Stage 2, while the latter is usually done by creating a custom
sls
file and adding it to the pillar. Details are
described in following sections.
This section lists several tasks that require adding/changing your own
sls
files. Such a procedure is typically used when you
need to change the default deployment process.
Your custom .sls files belong to the same subdirectory as DeepSea's .sls
files. To prevent overwriting your .sls files with the possibly newly added
ones from the DeepSea package, prefix their name with the
custom-
string.
If you address a specific task outside of the DeepSea deployment process and therefore need to skip it, create a 'no-operation' file following this example:
Create /srv/salt/ceph/time/disabled.sls
with the
following content and save it:
disable time setting: test.nop
Edit /srv/pillar/ceph/stack/global.yml
, add the
following line, and save it:
time_init: disabled
Verify by refreshing the pillar and running the step:
root@master #
salt target saltutil.pillar_refreshroot@master #
salt 'admin.ceph' state.apply ceph.time admin.ceph: Name: disable time setting - Function: test.nop - Result: Clean Summary for admin.ceph ------------ Succeeded: 1 Failed: 0 ------------ Total states run: 1
The task ID 'disable time setting' may be any message unique within an
sls
file. Prevent ID collisions by specifying unique
descriptions.
If you need to replace the default behavior of a specific step with a
custom one, create a custom sls
file with replacement
content.
By default /srv/salt/ceph/pool/default.sls
creates an
rbd image called 'demo'. In our example, we do not want this image to be
created, but we need two images: 'archive1' and 'archive2'.
Create /srv/salt/ceph/pool/custom.sls
with the
following content and save it:
wait: module.run: - name: wait.out - kwargs: 'status': "HEALTH_ERR"1 - fire_event: True archive1: cmd.run: - name: "rbd -p rbd create archive1 --size=1024"2 - unless: "rbd -p rbd ls | grep -q archive1$" - fire_event: True archive2: cmd.run: - name: "rbd -p rbd create archive2 --size=768" - unless: "rbd -p rbd ls | grep -q archive2$" - fire_event: True
The wait module will pause until the
Ceph cluster does not have a status of | |
The |
To call the newly created custom file instead of the default, you need to
edit /srv/pillar/ceph/stack/ceph/cluster.yml
, add
the following line, and save it:
pool_init: custom
Verify by refreshing the pillar and running the step:
root@master #
salt target saltutil.pillar_refreshroot@master #
salt 'admin.ceph' state.apply ceph.pool
The creation of pools or images requires sufficient authorization. The
admin.ceph
minion has an admin keyring.
Another option is to change the variable in
/srv/pillar/ceph/stack/ceph/roles/master.yml
instead.
Using this file will reduce the clutter of pillar data for other minions.
Sometimes you may need a specific step to do some additional tasks. We do not recommend modifying the related state file as it may complicate a future upgrade. Instead, create a separate file to carry out the additional tasks identical to what was described in Section 6.1.2, “Replacing a Deployment Step”.
Name the new sls
file descriptively. For example, if you
need to create two rbd images in addition to the demo image, name the file
archive.sls
.
Create /srv/salt/ceph/pool/custom.sls
with the
following content and save it:
include: - .archive - .default
In this example, Salt will create the archive
images and then create the demo image. The order
does not matter in this example. To change the order, reverse the lines
after the include:
directive.
You can add the include line directly to
archive.sls
and all the images will get created as
well. However, regardless of where the include line is placed, Salt
processes the steps in the included file first. Although this behavior
can be overridden with requires and
order statements, a separate file that includes the
others guarantees the order and reduces the chances of confusion.
Edit /srv/pillar/ceph/stack/ceph/cluster.yml
, add
the following line, and save it:
pool_init: custom
Verify by refreshing the pillar and running the step:
root@master #
salt target saltutil.pillar_refreshroot@master #
salt 'admin.ceph' state.apply ceph.pool
If you need to add a completely separate deployment step, create three new
files—an sls
file that performs the command, an
orchestration file, and a custom file which aligns the new step with the
original deployment steps.
For example, if you need to run logrotate
on all minions
as part of the preparation stage:
First create an sls
file and include the
logrotate
command.
logrotate
on all Salt minions #
Create a directory such as /srv/salt/ceph/logrotate
.
Create /srv/salt/ceph/logrotate/init.sls
with the
following content and save it:
rotate logs: cmd.run: - name: "/usr/sbin/logrotate /etc/logrotate.conf"
Verify that the command works on a minion:
root@master #
salt 'admin.ceph' state.apply ceph.logrotate
Because the orchestration file needs to run before all other preparation steps, add it to the Prep stage 0:
Create /srv/salt/ceph/stage/prep/logrotate.sls
with
the following content and save it:
logrotate: salt.state: - tgt: '*' - sls: ceph.logrotate
Verify that the orchestration file works:
root@master #
salt-run state.orch ceph.stage.prep.logrotate
The last file is the custom one which includes the additional step with the original steps:
Create /srv/salt/ceph/stage/prep/custom.sls
with the
following content and save it:
include: - .logrotate - .master - .minion
Override the default behavior. Edit
/srv/pillar/ceph/stack/global.yml
, add the following
line, and save the file:
stage_prep: custom
Verify that Stage 0 works:
root@master #
salt-run state.orch ceph.stage.0
global.yml
?
The global.yml
file is chosen over the
cluster.yml
because during the
prep stage, no minion belongs to the Ceph cluster
and has no access to any settings in cluster.yml
.
During Stage 0 (refer to DeepSea Stages Description for more information on DeepSea stages), the Salt master and Salt minions may optionally reboot because newly updated packages, for example kernel, require rebooting the system.
The default behavior is to install available new updates and not reboot the nodes even in case of kernel updates.
You can change the default update/reboot behavior of DeepSea Stage 0 by
adding/changing the stage_prep_master
and
stage_prep_minion
options in the
/srv/pillar/ceph/stack/global.yml
file.
stage_prep_master
sets the behavior of the Salt master, and
stage_prep_minion
sets the behavior of all minions. All
available parameters are:
Install updates without rebooting.
Install updates and reboot after updating.
Reboots without installing updates.
Do not install updates or reboot.
For example, to prevent the cluster nodes from installing updates and
rebooting, edit /srv/pillar/ceph/stack/global.yml
and
add the following lines:
stage_prep_master: default-no-update-no-reboot stage_prep_minion: default-no-update-no-reboot
The values of stage_prep_master
correspond to file names
located in /srv/salt/ceph/stage/0/master
, while
values of stage_prep_minion
correspond to files in
/srv/salt/ceph/stage/0/minion
:
cephadm >
ls -l /srv/salt/ceph/stage/0/master default-no-update-no-reboot.sls default-no-update-reboot.sls default-update-reboot.sls [...]cephadm >
ls -l /srv/salt/ceph/stage/0/minion default-no-update-no-reboot.sls default-no-update-reboot.sls default-update-reboot.sls [...]
After you completed Stage 2, you may want to change the discovered configuration. To view the current settings, run:
root@master #
salt target pillar.items
The output of the default configuration for a single minion is usually similar to the following:
---------- available_roles: - admin - mon - storage - mds - igw - rgw - client-cephfs - client-radosgw - client-iscsi - mds-nfs - rgw-nfs - master cluster: ceph cluster_network: 172.16.22.0/24 fsid: e08ec63c-8268-3f04-bcdb-614921e94342 master_minion: admin.ceph mon_host: - 172.16.21.13 - 172.16.21.11 - 172.16.21.12 mon_initial_members: - mon3 - mon1 - mon2 public_address: 172.16.21.11 public_network: 172.16.21.0/24 roles: - admin - mon - mds time_server: admin.ceph time_service: ntp
The above mentioned settings are distributed across several configuration
files. The directory structure with these files is defined in the
/srv/pillar/ceph/stack/stack.cfg
directory. The
following files usually describe your cluster:
/srv/pillar/ceph/stack/global.yml
- the file affects
all minions in the Salt cluster.
/srv/pillar/ceph/stack/ceph/cluster.yml
- the file affects all minions in the Ceph cluster called
ceph
.
/srv/pillar/ceph/stack/ceph/roles/role.yml
- affects all minions that are assigned the specific role in the
ceph
cluster.
/srv/pillar/ceph/stack/ceph/minions/MINION_ID/yml
- affects the individual minion.
There is a parallel directory tree that stores the default configuration
setup in /srv/pillar/ceph/stack/default
. Do not change
values here, as they are overwritten.
The typical procedure for changing the collected configuration is the following:
Find the location of the configuration item you need to change. For
example, if you need to change cluster related setting such as cluster
network, edit the file
/srv/pillar/ceph/stack/ceph/cluster.yml
.
Save the file.
Verify the changes by running:
root@master #
salt target saltutil.pillar_refresh
and then
root@master #
salt target pillar.items
Since IPv4 network addressing is prevalent, you need to enable IPv6 as a customization. DeepSea has no auto discovery of IPv6 addressing.
To configure IPv6, set the public_network
and
cluster_network
variables in the
/srv/pillar/ceph/stack/global.yml
file to valid IPv6
subnets. For example:
public_network: fd00:10::/64 cluster_network: fd00:11::/64
Then run DeepSea Stage 2 and verify that the network information matches
the setting. Stage 3 will generate the ceph.conf
with
the necessary flags.
Ceph does not support dual stack—running Ceph simultaneously on
IPv4 and IPv6 is not possible. DeepSea validation will reject a mismatch
between public_network
and
cluster_network
or within either variable. The following
example will fail the validation.
public_network: "192.168.10.0/24 fd00:10::/64"
fe80::/10 link-local
Addresses
Avoid using fe80::/10 link-local
addresses. All network
interfaces have an assigned fe80
address and require an
interface qualifier for proper routing. Either assign IPv6 addresses
allocated to your site or consider using fd00::/8
.
These are part of ULA and not globally routable.
After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If y…
Ceph Object Gateway is an object storage interface built on top of
librgw
to provide applications with a RESTful gateway to
Ceph clusters. It supports two interfaces:
iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage 6 includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and …
The Ceph file system (CephFS) is a POSIX-compliant file system that uses
a Ceph storage cluster to store its data. CephFS uses the same cluster
system as Ceph block devices, Ceph object storage with its S3 and Swift
APIs, or native bindings (librados
).
NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.
After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If you have a cluster deployed using Salt, refer to Chapter 4, Deploying with DeepSea/Salt for a procedure on installing particular gateways or the CephFS.
Ceph Object Gateway is an object storage interface built on top of
librgw
to provide applications with a RESTful gateway to
Ceph clusters. It supports two interfaces:
S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.
Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.
The Object Gateway daemon uses 'Beast' HTTP front-end by default. It uses the Boost.Beast library for HTTP parsing and the Boost.Asio library for asynchronous network I/O operations.
Because Object Gateway provides interfaces compatible with OpenStack Swift and Amazon S3, the Object Gateway has its own user management. Object Gateway can store data in the same cluster that is used to store data from CephFS clients or RADOS Block Device clients. The S3 and Swift APIs share a common name space, so you may write data with one API and retrieve it with the other.
Object Gateway is installed as a DeepSea role, therefore you do not need to install it manually.
To install the Object Gateway during the cluster deployment, see Section 4.3, “Cluster Deployment”.
To add a new node with Object Gateway to the cluster, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.
Install Object Gateway on a node that is not using port 80. The following command installs all required components:
root #
zypper ref && zypper in ceph-radosgw
If the Apache server from the previous Object Gateway instance is running, stop it and disable the relevant service:
root #
systemctl stop disable apache2.service
Edit /etc/ceph/ceph.conf
and add the following lines:
[client.rgw.gateway_host] rgw frontends = "beast port=80"
If you want to configure Object Gateway/Beast for use with SSL encryption, modify the line accordingly:
rgw frontends = beast ssl_port=7480 ssl_certificate=PATH_TO_CERTIFICATE.PEM
Restart the Object Gateway service.
root #
systemctl restart ceph-radosgw@rgw.gateway_host
Several steps are required to configure an Object Gateway.
Configuring a Ceph Object Gateway requires a running Ceph Storage Cluster. The Ceph Object Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:
A host name for the gateway instance, for example
gateway
.
A storage cluster user name with appropriate permissions and a keyring.
Pools to store its data.
A data directory for the gateway instance.
An instance entry in the Ceph configuration file.
Each instance must have a user name and key to communicate with a Ceph storage cluster. In the following steps, we use a monitor node to create a bootstrap keyring, then create the Object Gateway instance user keyring based on the bootstrap one. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.
Create a keyring for the gateway:
root #
ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyringroot #
chmod +r /etc/ceph/ceph.client.rgw.keyring
Generate a Ceph Object Gateway user name and key for each instance. As an
example, we will use the name gateway
after
client.radosgw
:
root #
ceph-authtool /etc/ceph/ceph.client.rgw.keyring \
-n client.rgw.gateway --gen-key
Add capabilities to the key:
root #
ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \
--cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring
Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:
root #
ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \
-i /etc/ceph/ceph.client.rgw.keyring
Distribute the keyring to the node with the gateway instance:
root #
scp /etc/ceph/ceph.client.rgw.keyring ceph@HOST_NAME:/home/cephcephadm >
ssh HOST_NAMEroot #
mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring
An alternative way is to create the Object Gateway bootstrap keyring, and then create the Object Gateway keyring from it:
Create an Object Gateway bootstrap keyring on one of the monitor nodes:
root #
ceph \
auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \
--connect-timeout=25 \
--cluster=ceph \
--name mon. \
--keyring=/var/lib/ceph/mon/ceph-NODE_HOST/keyring \
-o /var/lib/ceph/bootstrap-rgw/keyring
Create the
/var/lib/ceph/radosgw/ceph-RGW_NAME
directory for storing the bootstrap keyring:
root #
mkdir \
/var/lib/ceph/radosgw/ceph-RGW_NAME
Create an Object Gateway keyring from the newly created bootstrap keyring:
root #
ceph \
auth get-or-create client.rgw.RGW_NAME osd 'allow rwx' mon 'allow rw' \
--connect-timeout=25 \
--cluster=ceph \
--name client.bootstrap-rgw \
--keyring=/var/lib/ceph/bootstrap-rgw/keyring \
-o /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring
Copy the Object Gateway keyring to the Object Gateway host:
root #
scp \
/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring \
RGW_HOST:/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring
Ceph Object Gateways require Ceph Storage Cluster pools to store specific gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph configuration file.
The pool names follow the
ZONE_NAME.POOL_NAME
syntax. When configuring a gateway with the default region and zone, the
default zone name is 'default' as in our example:
.rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.buckets.index default.rgw.buckets.data
To create the pools manually, see Book “Administration Guide”, Chapter 9 “Managing Storage Pools”, Section 9.2.2 “Create a Pool”.
Only the default.rgw.buckets.data
pool can be
erasure-coded. All other pools need to be replicated, otherwise the
gateway is not accessible.
Add the Ceph Object Gateway configuration to the Ceph Configuration file. The Ceph Object Gateway configuration requires you to identify the Ceph Object Gateway instance. Then, specify the host name where you installed the Ceph Object Gateway daemon, a keyring (for use with cephx), and optionally a log file. For example:
[client.rgw.INSTANCE_NAME] host = HOST_NAME keyring = /etc/ceph/ceph.client.rgw.keyring
To override the default Object Gateway log file, include the following:
log file = /var/log/radosgw/client.rgw.INSTANCE_NAME.log
The [client.rgw.*]
portion of the gateway instance
identifies this portion of the Ceph configuration file as configuring a
Ceph Storage Cluster client where the client type is a Ceph Object Gateway
(radosgw). The instance name follows. For example:
[client.rgw.gateway] host = ceph-gateway keyring = /etc/ceph/ceph.client.rgw.keyring
The HOST_NAME must be your machine host name, excluding the domain name.
Then turn off print continue
. If you have it set to
true, you may encounter problems with PUT operations:
rgw print continue = false
To use a Ceph Object Gateway with subdomain S3 calls (for example
http://bucketname.hostname
), you must add the Ceph
Object Gateway DNS name under the [client.rgw.gateway]
section
of the Ceph configuration file:
[client.rgw.gateway] ... rgw dns name = HOST_NAME
You should also consider installing a DNS server such as Dnsmasq on your
client machine(s) when using the
http://BUCKET_NAME.HOST_NAME
syntax. The dnsmasq.conf
file should include the
following settings:
address=/HOST_NAME/HOST_IP_ADDRESS listen-address=CLIENT_LOOPBACK_IP
Then, add the CLIENT_LOOPBACK_IP IP address as the first DNS server on the client machine(s).
Deployment scripts may not create the default Ceph Object Gateway data directory.
Create data directories for each instance of a radosgw daemon if not
already done. The host
variables in the Ceph
configuration file determine which host runs each instance of a radosgw
daemon. The typical form specifies the radosgw daemon, the cluster name,
and the daemon ID.
root #
mkdir -p /var/lib/ceph/radosgw/CLUSTER_ID
Using the example ceph.conf
settings above, you would
execute the following:
root #
mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway
To ensure that all components have reloaded their configurations, we
recommend restarting your Ceph Storage Cluster service. Then, start up
the radosgw
service. For more information, see
Book “Administration Guide”, Chapter 3 “Introduction” and
Book “Administration Guide”, Chapter 15 “Ceph Object Gateway”, Section 15.3 “Operating the Object Gateway Service”.
When the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:
<ListAllMyBucketsResult> <Owner> <ID>anonymous</ID> <DisplayName/> </Owner> <Buckets/> </ListAllMyBucketsResult>
iSCSI is a storage area network (SAN) protocol that allows clients (called
initiators) to send SCSI commands to SCSI storage
devices (targets) on remote servers. SUSE Enterprise Storage
6 includes a facility that opens Ceph storage management to
heterogeneous clients, such as Microsoft Windows* and VMware* vSphere, through the
iSCSI protocol. Multipath iSCSI access enables availability and scalability
for these clients, and the standardized iSCSI protocol also provides an
additional layer of security isolation between clients and the SUSE Enterprise Storage
6 cluster. The configuration facility is named ceph-iscsi
. Using
ceph-iscsi
, Ceph storage administrators can define thin-provisioned,
replicated, highly-available volumes supporting read-only snapshots,
read-write clones, and automatic resizing with Ceph RADOS Block Device
(RBD). Administrators can then export volumes either via a single ceph-iscsi
gateway host, or via multiple gateway hosts supporting multipath failover.
Linux, Microsoft Windows, and VMware hosts can connect to volumes using the iSCSI
protocol, which makes them available like any other SCSI block device. This
means SUSE Enterprise Storage 6 customers can effectively run a complete
block-storage infrastructure subsystem on Ceph that provides all the
features and benefits of a conventional SAN, enabling future growth.
This chapter introduces detailed information to set up a Ceph cluster infrastructure together with an iSCSI gateway so that the client hosts can use remotely stored data as local storage devices using the iSCSI protocol.
iSCSI is an implementation of the Small Computer System Interface (SCSI) command set using the Internet Protocol (IP), specified in RFC 3720. iSCSI is implemented as a service where a client (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target's IP address and port are called an iSCSI portal, where a target can be exposed through one or more portals. The combination of a target and one or more portals is called the target portal group (TPG).
The underlying data link layer protocol for iSCSI is commonly Ethernet. More specifically, modern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput. 10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster is strongly recommended.
The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's original domain and Web site. For some time, no fewer than four competing iSCSI target implementations were available for the Linux platform, but LIO ultimately prevailed as the single iSCSI reference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguous name "target", distinguishing between "target core" and a variety of front-end and back-end target modules.
The most commonly used front-end module is arguably iSCSI. However, LIO also supports Fibre Channel (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At this time, only the iSCSI protocol is supported by SUSE Enterprise Storage.
The most frequently used target back-end module is one that is capable of simply re-exporting any available block device on the target host. This module is named iblock. However, LIO also has an RBD-specific back-end module supporting parallelized multipath I/O access to RBD images.
This section introduces brief information on iSCSI initiators used on Linux, Microsoft Windows, and VMware platforms.
The standard initiator for the Linux platform is
open-iscsi
. open-iscsi
launches a daemon, iscsid
, which the user can
then use to discover iSCSI targets on any given portal, log in to targets,
and map iSCSI volumes. iscsid
communicates with
the SCSI mid layer to create in-kernel block devices that the kernel can
then treat like any other SCSI block device on the system. The
open-iscsi
initiator can be deployed in
conjunction with the Device Mapper Multipath
(dm-multipath
) facility to provide a highly
available iSCSI block device.
The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSI initiator. The iSCSI service can be configured via a graphical user interface (GUI), and supports multipath I/O for high availability.
The default iSCSI initiator for VMware vSphere and ESX is the VMware
ESX software iSCSI initiator, vmkiscsi
. When
enabled, it can be configured either from the vSphere client, or using the
vmkiscsi-tool
command. You can then format storage
volumes connected through the vSphere iSCSI storage adapter with VMFS, and
use them like any other VM storage device. The VMware initiator also
supports multipath I/O for high availability.
ceph-iscsi
#Edit source
ceph-iscsi
combines the benefits of RADOS Block Devices with the ubiquitous
versatility of iSCSI. By employing ceph-iscsi
on an iSCSI target host (known
as the iSCSI Gateway), any application that needs to make use of block storage can
benefit from Ceph, even if it does not speak any Ceph client protocol.
Instead, users can use iSCSI or any other target front-end protocol to
connect to an LIO target, which translates all target I/O to RBD storage
operations.
ceph-iscsi
is inherently highly-available and supports multipath operations.
Thus, downstream initiator hosts can use multiple iSCSI gateways for both
high availability and scalability. When communicating with an iSCSI
configuration with more than one gateway, initiators may load-balance iSCSI
requests across multiple gateways. In the event of a gateway failing, being
temporarily unreachable, or being disabled for maintenance, I/O will
transparently continue via another gateway.
A minimum configuration of SUSE Enterprise Storage 6 with ceph-iscsi
consists of the following components:
A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servers hosting at least eight object storage daemons (OSDs) each. In such a configuration, three OSD nodes also double as a monitor (MON) host.
An iSCSI target server running the LIO iSCSI target, configured via
ceph-iscsi
.
An iSCSI initiator host, running open-iscsi
(Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible
iSCSI initiator implementation.
A recommended production configuration of SUSE Enterprise Storage 6 with
ceph-iscsi
consists of:
A Ceph storage cluster. A production Ceph cluster consists of any number of (typically more than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs), with no fewer than three dedicated MON hosts.
Several iSCSI target servers running the LIO iSCSI target, configured via
ceph-iscsi
. For iSCSI fail-over and load-balancing, these servers must run
a kernel supporting the target_core_rbd
module.
Update packages are available from the SUSE Linux Enterprise Server maintenance channel.
Any number of iSCSI initiator hosts, running
open-iscsi
(Linux), the Microsoft iSCSI Initiator
(Microsoft Windows), or any other compatible iSCSI initiator implementation.
This section describes steps to install and configure an iSCSI Gateway on top of SUSE Enterprise Storage.
You can deploy the iSCSI Gateway either during Ceph cluster deployment process, or add it to an existing cluster using DeepSea.
To include the iSCSI Gateway during the cluster deployment process, refer to Section 4.5.1.2, “Role Assignment”.
To add the iSCSI Gateway to an existing cluster, refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.
RBD images are created in the Ceph store and subsequently exported to
iSCSI. We recommend that you use a dedicated RADOS pool for this purpose.
You can create a volume from any host that is able to connect to your
storage cluster using the Ceph rbd
command line
utility. This requires the client to have at least a minimal ceph.conf
configuration file, and appropriate CephX authentication credentials.
To create a new volume for subsequent export via iSCSI, use the
rbd create
command, specifying the volume size in
megabytes. For example, in order to create a 100 GB volume named 'testvol'
in the pool named 'iscsi-images', run:
cephadm >
rbd --pool iscsi-images create --size=102400 'testvol'
To export RBD images via iSCSI, you can use either Ceph Dashboard Web
interface or the ceph-iscsi
gwcli utility. In this section we will focus
on gwcli only, demonstrating how to create an iSCSI target that exports
an RBD image using the command line.
Only the following RBD image features are supported:
layering
, striping (v2)
,
exclusive-lock
, fast-diff
and
data-pool
. RBD images with any other feature enabled
cannot be exported.
As root
, start the iSCSI gateway command line interface:
root #
gwcli
Go to iscsi-targets
and create a target with the name
iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol
:
gwcli >
/> cd /iscsi-targetsgwcli >
/iscsi-targets> create iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol
Create the iSCSI gateways by specifying the gateway name
and ip
address:
gwcli >
/iscsi-targets> cd iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/gatewaysgwcli >
/iscsi-target...tvol/gateways> create iscsi1 192.168.124.104gwcli >
/iscsi-target...tvol/gateways> create iscsi2 192.168.124.105
Use the help
command to show the list of available
commands in the current configuration node.
Add the RBD image with the name 'testvol' in the pool 'iscsi-images':
gwcli >
/iscsi-target...tvol/gateways> cd /disksgwcli >
/disks> attach iscsi-images/testvol
Map the RBD image to the target:
gwcli >
/disks> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/disksgwcli >
/iscsi-target...testvol/disks> add iscsi-images/testvol
You can use lower level tools, such as targetcli
, to
query the local configuration, but not to modify it.
You can use the ls
command to review the configuration.
Some configuration nodes also support the info
command
that can be used to display more detailed information.
Note that, by default, ACL authentication is enabled so this target is not accessible, yet. Check Section 9.4.4, “Authentication and Access Control” for more information about authentication and access control.
iSCSI authentication is flexible and covers many authentication possibilities.
'No authentication' means that any initiator will be able to access any LUNs on the corresponding target. You can enable 'No authentication' by disabling the ACL authentication:
gwcli >
/> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hostsgwcli >
/iscsi-target...testvol/hosts> auth disable_acl
When using initiator name based ACL authentication, only the defined initiators are allowed to connect. You can define an initiator by doing:
gwcli >
/> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hostsgwcli >
/iscsi-target...testvol/hosts> create iqn.1996-04.de.suse:01:e6ca28cc9f20
Defined initiators will be able to connect, but will only have access to the RBD images that were explicitly added to the initiator:
gwcli >
/iscsi-target...:e6ca28cc9f20> disk add rbd/testvol
In addition to the ACL, you can enable the CHAP authentication by specifying a user name and password for each initiator:
gwcli >
/> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hosts/iqn.1996-04.de.suse:01:e6ca28cc9f20gwcli >
/iscsi-target...:e6ca28cc9f20> auth username=common12 password=pass12345678
User names must have a length of 8 to 64 characters and can only contain letters, '.', '@', '-', '_' or ':'.
Passwords must have a length of 12 to 16 characters and can only contain letters, '@', '-', '_' or '/'.
Optionally, you can also enable the CHAP mutual authentication by
specifying the mutual_username
and
mutual_password
parameters in the auth
command.
Discovery authentication is independent of the previous authentication methods. It requires credentials for browsing, it is optional, and can be configured by:
gwcli >
/> cd /iscsi-targetsgwcli >
/iscsi-targets> discovery_auth username=du123456 password=dp1234567890
User-names must have a length of 8 to 64 characters and can only contain letters, '.', '@', '-', '_' or ':'.
Passwords must have a length of 12 to 16 characters and can only contain letters, '@', '-', '_' or '/'.
Optionally, you can also specify the mutual_username
and
mutual_password
parameters in the
discovery_auth
command.
Discovery authentication can be disabled by using the following command:
gwcli >
/iscsi-targets> discovery_auth nochap
ceph-iscsi
can be configured with advanced parameters which are subsequently
passed on to the LIO I/O target. The parameters are divided up into
'target' and 'disk' parameters.
Unless otherwise noted, changing these parameters from the default setting is not recommended.
You can view the value of these settings by using the
info
command:
gwcli >
/> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvolgwcli >
/iscsi-target...i.x86:testvol> info
And change a setting using the reconfigure
command:
gwcli >
/iscsi-target...i.x86:testvol> reconfigure login_timeout 20
The available 'target' settings are:
Default CmdSN (Command Sequence Number) depth. Limits the amount of requests that an iSCSI initiator can have outstanding at any moment.
Default error recovery level.
Login timeout value in seconds.
NIC failure timeout in seconds.
If set to 1, prevents writes to LUNs.
You can view the value of these settings by using the
info
command:
gwcli >
/> cd /disks/rbd/testvolgwcli >
/disks/rbd/testvol> info
And change a setting using the reconfigure
command:
gwcli >
/disks/rbd/testvol> reconfigure rbd/testvol emulate_pr 0
The available 'disk' settings are:
Block size of the underlying device.
If set to 1, enables Third Party Copy.
If set to 1, enables Compare and Write.
If set to 1, turns on Disable Page Out.
If set to 1, enables Force Unit Access read.
If set to 1, enables Force Unit Access write.
If set to 1, uses the back-end device name for the model alias.
If set to 0, support for SCSI Reservations, including Persistent Group Reservations, is disabled. While disabled, the SES iSCSI Gateway can ignore reservation state, resulting in improved request latency.
Setting backstore_emulate_pr to 0 is recommended if iSCSI initiators do not require SCSI Reservation support.
If set to 0, the Queue Algorithm Modifier has Restricted Reordering.
If set to 1, enables Task Aborted Status.
If set to 1, enables Thin Provisioning Unmap.
If set to 1, enables Thin Provisioning Write Same.
If set to 1, enables Unit Attention Interlock.
If set to 1, turns on Write Cache Enable.
If set to 1, enforces persistent reservation ISIDs.
If set to 1, the backstore is a non-rotational device.
Maximum number of block descriptors for UNMAP.
Maximum number of LBAs for UNMAP.
Maximum length for WRITE_SAME.
Optimal request size in sectors.
DIF protection type.
Queue depth.
UNMAP granularity.
UNMAP granularity alignment.
When enabled, LIO will always write out the persistent
reservation state to persistent storage, regardless of
whether or not the client has requested it via
aptpl=1
. This has no effect with the kernel RBD
back-end for LIO—it always persists PR state. Ideally, the
target_core_rbd
option should force it to '1' and
throw an error if someone tries to disable it via configfs.
Affects whether LIO will advertise LBPRZ to SCSI initiators, indicating that zeros will be read back from a region following UNMAP or WRITE SAME with an unmap bit.
tcmu-runner
#Edit source
The ceph-iscsi
supports both rbd
(kernel-based), and
user:rbd
(tcmu-runner) backstores making all the management
transparent and independent of the backstore.
tcmu-runner
based iSCSI Gateway deployments are currently
a technology preview.
Unlike kernel-based iSCSI Gateway deployments, tcmu-runner
based iSCSI Gateways do not offer support for multipath I/O or SCSI Persistent
Reservations.
To export an RADOS Block Device image using tcmu-runner
, all you
need to do is specify the user:rbd
backstore when attaching
the disk:
gwcli >
/disks> attach rbd/testvol backstore=user:rbd
When using tcmu-runner
, the exported RBD image
must have the exclusive-lock
feature enabled.
The Ceph file system (CephFS) is a POSIX-compliant file system that uses
a Ceph storage cluster to store its data. CephFS uses the same cluster
system as Ceph block devices, Ceph object storage with its S3 and Swift
APIs, or native bindings (librados
).
To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.
With SUSE Enterprise Storage 6, SUSE introduces official support for many scenarios in which the scale-out and distributed component CephFS is used. This entry describes hard limits and provides guidance for the suggested use cases.
A supported CephFS deployment must meet these requirements:
A minimum of one Metadata Server. SUSE recommends to deploy several nodes with the
MDS role. Only one will be active
and the rest will be
passive
.
Remember to mention all the MON nodes in the mount
command when mounting the CephFS from a client.
Clients are SUSE Linux Enterprise Server 12 SP3 or newer, or SUSE Linux Enterprise Server 15 or newer, using the
cephfs
kernel module driver. The FUSE module is not
supported.
CephFS quotas are supported in SUSE Enterprise Storage 6 and can
be set on any subdirectory of the Ceph file system. The quota restricts
either the number of bytes
or files
stored beneath the specified point in the directory hierarchy. For more
information, see Book “Administration Guide”, Chapter 17 “Clustered File System”, Section 17.6 “Setting CephFS Quotas”.
CephFS supports file layout changes as documented in
Section 10.3.4, “File Layouts”. However, while the file system is
mounted by any client, new data pools may not be added to an existing
CephFS file system (ceph mds add_data_pool
). They may
only be added while the file system is unmounted.
A minimum of one Metadata Server. We recommend deploying several
nodes with the MDS role. By default additional MDS daemons start
as standby
daemons, acting as backups for the active MDS.
Multiple active MDS daemons are also supported
(refer to section Section 10.3.2, “MDS Cluster Size”).
Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block
devices and Ceph object storage do not use MDS. MDSs
make it possible for POSIX file system users to execute basic
commands—such as ls
or
find
—without placing an enormous burden on the
Ceph storage cluster.
You can deploy MDS either during the initial cluster deployment process as described in Section 4.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.1 “Adding New Cluster Nodes”.
After you deploy your MDS, allow the Ceph OSD/MDS
service in the firewall setting of the server where MDS is deployed: Start
yast
, navigate to › › and in the drop–down menu select . If the Ceph MDS node is not allowed full traffic,
mounting of a file system fails, even though other operations may work
properly.
You can fine-tune the MDS behavior by inserting relevant options in the
ceph.conf
configuration file.
If set to 'true' (default), monitors force standby-replay to be active.
Set under [mon]
or [global]
sections.
mds cache memory limit
The soft memory limit (in bytes) that the MDS will enforce for its
cache. Administrators should use this instead of the old mds
cache size
setting. Defaults to 1GB.
mds cache reservation
The cache reservation (memory or inodes) for the MDS cache to maintain. When the MDS begins touching its reservation, it will recall client state until its cache size shrinks to restore the reservation. Defaults to 0.05.
The number of inodes to cache. A value of 0 (default) indicates an
unlimited number. It is recommended to use mds cache memory
limit
to limit the amount of memory the MDS cache uses.
The insertion point for new items in the cache LRU (from the top). Default is 0.7
The fraction of directory that is dirty before Ceph commits using a full update instead of partial update. Default is 0.5
The maximum size of a directory update before Ceph breaks it into smaller transactions. Default is 90 MB.
The half-life of MDS cache temperature. Default is 5.
The frequency in seconds of beacon messages sent to the monitor. Default is 4.
The interval without beacons before Ceph declares an MDS laggy and possibly replace it. Default is 15.
The blacklist duration for failed MDSs in the OSD map. This setting
controls how long failed MDS daemons will stay in the OSD map blacklist.
It has no effect on how long something is blacklisted when the
administrator blacklists it manually. For example, the ceph osd
blacklist add
command will still use the default blacklist
time. Default is 24 * 60.
The interval in seconds to wait for clients to reconnect during MDS restart. Default is 45.
How frequently the MDS performs internal periodic tasks. Default is 5.
The minimum interval in seconds to try to avoid propagating recursive stats up the tree. Default is 1.
How quickly dirstat changes propagate up. Default is 5.
The number of inode numbers to preallocate per client session. Default is 1000.
Determines whether the MDS should allow clients to see request results before they commit to the journal. Default is 'true'.
Use trivial map for directory updates. Default is 'true'.
The function to use for hashing files across directory fragments. Default is 2 (that is 'rjenkins').
Determines whether the MDS should try to skip corrupt journal events during journal replay. Default is 'false'.
The maximum events in the journal before we initiate trimming. Set to -1 (default) to disable limits.
The maximum number of segments (objects) in the journal before we initiate trimming. Set to -1 to disable limits. Default is 30.
The maximum number of segments to expire in parallels. Default is 20.
The maximum number of inodes in an EOpen event. Default is 100.
Determines how frequently to sample directory temperature for fragmentation decisions. Default is 3.
The maximum temperature before Ceph attempts to replicate metadata to other nodes. Default is 8000.
The minimum temperature before Ceph stops replicating metadata to other nodes. Default is 0.
The maximum directory size before the MDS will split a directory fragment into smaller bits. Default is 10000.
The maximum directory read temperature before Ceph splits a directory fragment. Default is 25000.
The maximum directory write temperature before Ceph splits a directory fragment. Default is 10000.
The number of bits by which to split a directory fragment. Default is 3.
The minimum directory size before Ceph tries to merge adjacent directory fragments. Default is 50.
The frequency in seconds of workload exchanges between MDSs. Default is 10.
The delay in seconds between a fragment is capable of splitting or merging, and execution of the fragmentation change. Default is 5.
The ratio by which fragments may exceed the split size before a split is executed immediately, skipping the fragment interval. Default is 1.5
The maximum size of a fragment before any new entries are rejected with ENOSPC. Default is 100000.
The minimum temperature before Ceph migrates a subtree back to its parent. Default is 0.
The method for calculating MDS load:
0 = Hybrid.
1 = Request rate and latency.
2 = CPU load.
Default is 0.
The minimum subtree temperature before Ceph migrates. Default is 0.1
The minimum subtree temperature before Ceph searches a subtree. Default is 0.2
The minimum fraction of target subtree size to accept. Default is 0.8
The maximum fraction of target subtree size to accept. Default is 1.2
Ceph will migrate any subtree that is larger than this fraction of the target subtree size. Default is 0.3
Ceph will ignore any subtree that is smaller than this fraction of the target subtree size. Default is 0.001
The minimum number of balancer iterations before Ceph removes an old MDS target from the MDS map. Default is 5.
The maximum number of balancer iteration before Ceph removes an old MDS target from the MDS map. Default is 10.
The journal poll interval when in standby-replay mode ('hot standby'). Default is 1.
The interval for polling the cache during MDS shutdown. Default is 0.
Ceph will randomly fragment or merge directories. Default is 0.
Ceph will dump the MDS cache contents to a file on each MDS map. Default is 'false'.
Ceph will dump MDS cache contents to a file after rejoining the cache during recovery. Default is 'false'.
An MDS daemon will standby for another MDS daemon of the name specified in this setting.
An MDS daemon will standby for an MDS daemon of this rank. Default is -1.
Determines whether a Ceph MDS daemon should poll and replay the log of an active MDS ('hot standby'). Default is 'false'.
Set the minimum number of capabilities a client may hold. Default is 100.
Set the maximum ratio of current caps that may be recalled during MDS cache pressure. Default is 0.8
How frequently to update the journal head object. Default is 15.
How many stripe periods to read-ahead on journal replay. Default is 10.
How many stripe periods to zero ahead of write position. Default 10.
Maximum additional latency in seconds we incur artificially. Default is 0.001
Maximum bytes we will delay flushing. Default is 0.
When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you can create and mount your Ceph file system. Ensure that your client has network connectivity and a proper authentication keyring.
A CephFS requires at least two RADOS pools: one for data and one for metadata. When configuring these pools, you might consider:
Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole file system inaccessible.
Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of file system operations on clients.
When assigning a role-mds
in the
policy.cfg
, the required pools are automatically
created. You can manually create the pools cephfs_data
and cephfs_metadata
for manual performance tuning before
setting up the Metadata Server. DeepSea will not create these pools if they already
exist.
For more information on managing pools, see Book “Administration Guide”, Chapter 9 “Managing Storage Pools”.
To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:
cephadm >
ceph osd pool create cephfs_data pg_numcephadm >
ceph osd pool create cephfs_metadata pg_num
It is possible to use EC pools instead of replicated pools. We recommend to
only use EC pools for low performance requirements and infrequent random
access, for example cold storage, backups, archiving. CephFS on EC pools
requires BlueStore to be enabled and the pool must have the
allow_ec_overwrite
option set. This option can be set by
running ceph osd pool set ec_pool allow_ec_overwrites
true
.
Erasure coding adds significant overhead to file system operations, especially small updates. This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penalty is the trade off for significantly reduced storage space overhead.
When the pools are created, you may enable the file system with the
ceph fs new
command:
cephadm >
ceph fs new fs_name metadata_pool_name data_pool_name
For example:
cephadm >
ceph fs new cephfs cephfs_metadata cephfs_data
You can check that the file system was created by listing all available CephFSs:
cephadm >
ceph
fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
When the file system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:
cephadm >
ceph
mds stat
e5: 1/1/1 up
You can find more information of specific tasks—for example mounting, unmounting, and advanced CephFS setup—in Book “Administration Guide”, Chapter 17 “Clustered File System”.
A CephFS instance can be served by multiple active MDS daemons. All active MDS daemons that are assigned to a CephFS instance will distribute the file system's directory tree between themselves, and thus spread the load of concurrent clients. In order to add an active MDS daemon to a CephFS instance, a spare standby is needed. Either start an additional daemon or use an existing standby instance.
The following command will display the current number of active and passive MDS daemons.
cephadm >
ceph mds stat
The following command sets the number of active MDS's to two in a file system instance.
cephadm >
ceph fs set fs_name max_mds 2
In order to shrink the MDS cluster prior to an update, two steps are
necessary. First set max_mds
so that only one instance
remains:
cephadm >
ceph fs set fs_name max_mds 1
and after that explicitly deactivate the other active MDS daemons:
cephadm >
ceph mds deactivate fs_name:rank
where rank is the number of an active MDS daemon
of a file system instance, ranging from 0 to max_mds
-1.
We recommend at least one MDS is left as a standby daemon.
During Ceph updates, the feature flags on a file system instance may change (usually by adding new features). Incompatible daemons (such as the older versions) are not able to function with an incompatible feature set and will refuse to start. This means that updating and restarting one daemon can cause all other not yet updated daemons to stop and refuse to start. For this reason we, recommend shrinking the active MDS cluster to size one and stopping all standby daemons before updating Ceph. The manual steps for this update procedure are as follows:
Update the Ceph related packages using zypper
.
Shrink the active MDS cluster as described above to 1 instance and stop
all standby MDS daemons using their systemd
units on all other nodes:
root #
systemctl stop ceph-mds\*.service ceph-mds.target
Only then restart the single remaining MDS daemon, causing it to restart using the updated binary.
root #
systemctl restart ceph-mds\*.service ceph-mds.target
Restart all other MDS daemons and re-set the desired
max_mds
setting.
root #
systemctl start ceph-mds.target
If you use DeepSea, it will follow this procedure in case the ceph package was updated during Stages 0 and 4. It is possible to perform this procedure while clients have the CephFS instance mounted and I/O is ongoing. Note however that there will be a very brief I/O pause while the active MDS restarts. Clients will recover automatically.
It is good practice to reduce the I/O load as much as possible before updating an MDS cluster. An idle MDS cluster will go through this update procedure quicker. Conversely, on a heavily loaded cluster with multiple MDS daemons it is essential to reduce the load in advance to prevent a single MDS daemon from being overwhelmed by ongoing I/O.
The layout of a file controls how its contents are mapped to Ceph RADOS objects. You can read and write a file’s layout using virtual extended attributes or shortly xattrs.
The name of the layout xattrs depends on whether a file is a regular file
or a directory. Regular files’ layout xattrs are called
ceph.file.layout
, while directories’ layout xattrs are
called ceph.dir.layout
. Where examples refer to
ceph.file.layout
, substitute the
.dir.
part as appropriate when dealing with directories.
The following attribute fields are recognized:
ID or name of a RADOS pool in which a file’s data objects will be stored.
RADOS namespace within a data pool to which the objects will be written. It is empty by default, meaning the default namespace.
The size in bytes of a block of data used in the RAID 0 distribution of a file. All stripe units for a file have equal size. The last stripe unit is typically incomplete—it represents the data at the end of the file as well as the unused 'space' beyond it up to the end of the fixed stripe unit size.
The number of consecutive stripe units that constitute a RAID 0 'stripe' of file data.
The size in bytes of RADOS objects into which the file data is chunked.
RADOS enforces a configurable limit on object sizes. If you increase
CephFS object sizes beyond that limit then writes may not succeed.
The OSD setting is osd_max_object_size
, which is
128MB by default. Very large RADOS objects may prevent smooth
operation of the cluster, so increasing the object size limit past the
default is not recommended.
getfattr
#Edit source
Use the getfattr
command to read the layout information
of an example file file
as a single string:
root #
touch fileroot #
getfattr -n ceph.file.layout file # file: file ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=419430
Read individual layout fields:
root #
getfattr -n ceph.file.layout.pool file # file: file ceph.file.layout.pool="cephfs_data"root #
getfattr -n ceph.file.layout.stripe_unit file # file: file ceph.file.layout.stripe_unit="4194304"
When reading layouts, the pool will usually be indicated by name. However, in rare cases when pools have only just been created, the ID may be output instead.
Directories do not have an explicit layout until it is customized. Attempts to read the layout will fail if it has never been modified: this indicates that layout of the next ancestor directory with an explicit layout will be used.
root #
mkdir dirroot #
getfattr -n ceph.dir.layout dir dir: ceph.dir.layout: No such attributeroot #
setfattr -n ceph.dir.layout.stripe_count -v 2 dirroot #
getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
setfattr
#Edit source
Use the setfattr
command to modify the layout fields of
an example file file
:
root #
ceph osd lspools 0 rbd 1 cephfs_data 2 cephfs_metadataroot #
setfattr -n ceph.file.layout.stripe_unit -v 1048576 fileroot #
setfattr -n ceph.file.layout.stripe_count -v 8 file # Setting pool by ID:root #
setfattr -n ceph.file.layout.pool -v 1 file # Setting pool by name:root #
setfattr -n ceph.file.layout.pool -v cephfs_data file
When the layout fields of a file are modified using
setfattr
, this file needs to be empty otherwise an
error will occur.
If you want to remove an explicit layout from an example directory
mydir
and revert back to inheriting the layout of its
ancestor, run the following:
root #
setfattr -x ceph.dir.layout mydir
Similarly, if you have set the 'pool_namespace' attribute and wish to modify the layout to use the default namespace instead, run:
# Create a directory and set a namespace on itroot #
mkdir mydirroot #
setfattr -n ceph.dir.layout.pool_namespace -v foons mydirroot #
getfattr -n ceph.dir.layout mydir ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \ pool=cephfs_data_a pool_namespace=foons" # Clear the namespace from the directory's layoutroot #
setfattr -x ceph.dir.layout.pool_namespace mydirroot #
getfattr -n ceph.dir.layout mydir ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \ pool=cephfs_data_a"
Files inherit the layout of their parent directory at creation time. However, subsequent changes to the parent directory’s layout do not affect children:
root #
getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data" # file1 inherits its parent's layoutroot #
touch dir/file1root #
getfattr -n ceph.file.layout dir/file1 # file: dir/file1 ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data" # update the layout of the directory before creating a second fileroot #
setfattr -n ceph.dir.layout.stripe_count -v 4 dirroot #
touch dir/file2 # file1's layout is unchangedroot #
getfattr -n ceph.file.layout dir/file1 # file: dir/file1 ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data" # ...while file2 has the parent directory's new layoutroot #
getfattr -n ceph.file.layout dir/file2 # file: dir/file2 ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"
Files created as descendants of the directory also inherit its layout if the intermediate directories do not have layouts set:
root #
getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"root #
mkdir dir/childdirroot #
getfattr -n ceph.dir.layout dir/childdir dir/childdir: ceph.dir.layout: No such attributeroot #
touch dir/childdir/grandchildroot #
getfattr -n ceph.file.layout dir/childdir/grandchild # file: dir/childdir/grandchild ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"
Before you can use a pool with CephFS, you need to add it to the Metadata Server:
root #
ceph fs add_data_pool cephfs cephfs_data_ssdroot #
ceph fs ls # Pool should now show up .... data pools: [cephfs_data cephfs_data_ssd ]
Make sure that your cephx keys allows the client to access this new pool.
You can then update the layout on a directory in CephFS to use the pool you added:
root #
mkdir /mnt/cephfs/myssddirroot #
setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir
All new files created within that directory will now inherit its layout and place their data in your newly added pool. You may notice that the number of objects in your primary data pool continue to increase, even if files are being created in the pool you newly added. This is normal: the file data is stored in the pool specified by the layout, but a small amount of metadata is kept in the primary data pool for all files.
NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.
Native CephFS and NFS clients are not restricted by file locks obtained via Samba, and vice-versa. Applications that rely on cross protocol file locking may experience data corruption if CephFS backed Samba share paths are accessed via other means.
To successfully deploy NFS Ganesha, you need to add a
role-ganesha
to your
/srv/pillar/ceph/proposals/policy.cfg
. For details,
see Section 4.5.1, “The policy.cfg
File”. NFS Ganesha also needs either a
role-rgw
or a role-mds
present in the
policy.cfg
.
Although it is possible to install and run the NFS Ganesha server on an already existing Ceph node, we recommend running it on a dedicated host with access to the Ceph cluster. The client hosts are typically not part of the cluster, but they need to have network access to the NFS Ganesha server.
To enable the NFS Ganesha server at any point after the initial installation,
add the role-ganesha
to the
policy.cfg
and re-run at least DeepSea stages 2 and
4. For details, see Section 4.3, “Cluster Deployment”.
NFS Ganesha is configured via the file
/etc/ganesha/ganesha.conf
that exists on the NFS Ganesha
node. However, this file is overwritten each time DeepSea stage 4 is
executed. Therefore we recommend to edit the template used by Salt, which
is the file
/srv/salt/ceph/ganesha/files/ganesha.conf.j2
on the
Salt master. For details about the configuration file, see
Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”, Section 19.2 “Configuration”.
The following requirements need to be met before DeepSea stages 2 and 4 can be executed to install NFS Ganesha:
At least one node needs to be assigned the
role-ganesha
.
You can define only one role-ganesha
per minion.
NFS Ganesha needs either an Object Gateway or CephFS to work.
The kernel based NFS needs to be disabled on minions with the
role-ganesha
role.
This procedure provides an example installation that uses both the Object Gateway and CephFS File System Abstraction Layers (FSAL) of NFS Ganesha.
If you have not done so, execute DeepSea stages 0 and 1 before continuing with this procedure.
root@master #
salt-run
state.orch ceph.stage.0root@master #
salt-run
state.orch ceph.stage.1
After having executed stage 1 of DeepSea, edit the
/srv/pillar/ceph/proposals/policy.cfg
and add the
line
role-ganesha/cluster/NODENAME
Replace NODENAME with the name of a node in your cluster.
Also make sure that a role-mds
and a
role-rgw
are assigned.
Execute at least stages 2 and 4 of DeepSea. Running stage 3 in between is recommended.
root@master #
salt-run
state.orch ceph.stage.2root@master #
salt-run
state.orch ceph.stage.3 # optional but recommendedroot@master #
salt-run
state.orch ceph.stage.4
Verify that NFS Ganesha is working by checking that NFS Ganesha service is running on the minion node:
root #
salt
-I roles:ganesha service.status nfs-ganesha MINION_ID: True
This section provides an example of how to set up a two-node active-passive
configuration of NFS Ganesha servers. The setup requires the SUSE Linux Enterprise High Availability Extension. The
two nodes are called earth
and mars
.
Services that have their own fault tolerance and their own load balancing should not be running on cluster nodes that get fenced for failover services. Therefore, do not run Ceph Monitor, Metadata Server, iSCSI, or Ceph OSD services on High Availability setups.
For details about SUSE Linux Enterprise High Availability Extension, see https://www.suse.com/documentation/sle-ha-15/.
In this setup earth
has the IP address
192.168.1.1
and mars
has
the address 192.168.1.2
.
Additionally, two floating virtual IP addresses are used, allowing clients
to connect to the service independent of which physical node it is running
on. 192.168.1.10
is used for
cluster administration with Hawk2 and
192.168.2.1
is used exclusively
for the NFS exports. This makes it easier to apply security restrictions
later.
The following procedure describes the example installation. More details can be found at https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/art_sleha_install_quick.html.
Prepare the NFS Ganesha nodes on the Salt master:
Run DeepSea stages 0 and 1.
root@master #
salt-run
state.orch ceph.stage.0root@master #
salt-run
state.orch ceph.stage.1
Assign the nodes earth
and mars
the
role-ganesha
in the
/srv/pillar/ceph/proposals/policy.cfg
:
role-ganesha/cluster/earth*.sls role-ganesha/cluster/mars*.sls
Run DeepSea stages 2 to 4.
root@master #
salt-run
state.orch ceph.stage.2root@master #
salt-run
state.orch ceph.stage.3root@master #
salt-run
state.orch ceph.stage.4
Register the SUSE Linux Enterprise High Availability Extension on earth
and mars
.
root #
SUSEConnect
-r ACTIVATION_CODE -e E_MAIL
Install ha-cluster-bootstrap on both nodes:
root #
zypper
in ha-cluster-bootstrap
Initialize the cluster on earth
:
root@earth #
ha-cluster-init
Let mars
join the cluster:
root@mars #
ha-cluster-join
-c earth
Check the status of the cluster. You should see two nodes added to the cluster:
root@earth #
crm
status
On both nodes, disable the automatic start of the NFS Ganesha service at boot time:
root #
systemctl
disable nfs-ganesha
Start the crm
shell on earth
:
root@earth #
crm
configure
The next commands are executed in the crm shell.
On earth
, run the crm shell to execute the following commands to
configure the resource for NFS Ganesha daemons as clone of systemd resource
type:
crm(live)configure#
primitive nfs-ganesha-server systemd:nfs-ganesha \ op monitor interval=30scrm(live)configure#
clone nfs-ganesha-clone nfs-ganesha-server meta interleave=truecrm(live)configure#
commitcrm(live)configure#
status 2 nodes configured 2 resources configured Online: [ earth mars ] Full list of resources: Clone Set: nfs-ganesha-clone [nfs-ganesha-server] Started: [ earth mars ]
Create a primitive IPAddr2 with the crm shell:
crm(live)configure#
primitive ganesha-ip IPaddr2 \ params ip=192.168.2.1 cidr_netmask=24 nic=eth0 \ op monitor interval=10 timeout=20crm(live)#
status Online: [ earth mars ] Full list of resources: Clone Set: nfs-ganesha-clone [nfs-ganesha-server] Started: [ earth mars ] ganesha-ip (ocf::heartbeat:IPaddr2): Started earth
To set up a relationship between the NFS Ganesha server and the floating Virtual IP, we use collocation and ordering.
crm(live)configure#
colocation ganesha-ip-with-nfs-ganesha-server inf: ganesha-ip nfs-ganesha-clonecrm(live)configure#
order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-clone ganesha-ip
Use the mount
command from the client to ensure that
cluster setup is complete:
root #
mount
-t nfs -v -o sync,nfsvers=4 192.168.2.1:/ /mnt
In the event of an NFS Ganesha failure at one of the node, for example
earth
, fix the issue and clean up the resource. Only after the
resource is cleaned up can the resource fail back to earth
in case
NFS Ganesha fails at mars
.
To clean up the resource:
root@earth #
crm
resource cleanup nfs-ganesha-clone earthroot@earth #
crm
resource cleanup ganesha-ip earth
It may happen that the server is unable to reach the client because of a network issue. A ping resource can detect and mitigate this problem. Configuring this resource is optional.
Define the ping resource:
crm(live)configure#
primitive ganesha-ping ocf:pacemaker:ping \
params name=ping dampen=3s multiplier=100 host_list="CLIENT1 CLIENT2" \
op monitor interval=60 timeout=60 \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60
host_list
is a list of IP addresses separated by space
characters. The IP addresses will be pinged regularly to check for
network outages. If a client must always have access to the NFS server,
add it to host_list
.
Create a clone:
crm(live)configure#
clone ganesha-ping-clone ganesha-ping \
meta interleave=true
The following command creates a constraint for the NFS Ganesha service. It
forces the service to move to another node when
host_list
is unreachable.
crm(live)configure#
location nfs-ganesha-server-with-ganesha-ping
nfs-ganesha-clone \
rule -inf: not_defined ping or ping lte 0
DeepSea does not support configuring NFS Ganesha HA. To prevent DeepSea from failing after NFS Ganesha HA was configured, exclude starting and stopping the NFS Ganesha service from DeepSea Stage 4:
Copy /srv/salt/ceph/ganesha/default.sls
to
/srv/salt/ceph/ganesha/ha.sls
.
Remove the .service
entry from
/srv/salt/ceph/ganesha/ha.sls
so that it looks as
follows:
include: - .keyring - .install - .configure
Add the following line to
/srv/pillar/ceph/stack/global.yml
:
ganesha_init: ha
This section provides an example of simple active-active NFS Ganesha setup. The aim is to deploy two NFS Ganesha servers layered on top of the same existing CephFS. The servers will be two Ceph cluster nodes with separate addresses. The clients need to be distributed between them manually. “Failover” in this configuration means manually unmounting and remounting the other server on the client.
For our example configuration, you need the following:
Running Ceph cluster. See Section 4.3, “Cluster Deployment” for details on deploying and configuring Ceph cluster by using DeepSea.
At least one configured CephFS. See Chapter 10, Installation of CephFS for more details on deploying and configuring CephFS.
Two Ceph cluster nodes with NFS Ganesha deployed. See Chapter 11, Installation of NFS Ganesha for more details on deploying NFS Ganesha.
Although NFS Ganesha nodes can share resources with other Ceph related services, we recommend to use dedicated servers to improve performance.
After you deploy the NFS Ganesha nodes, verify that the cluster is operational and the default CephFS pools are there:
cephadm >
rados lspools
cephfs_data
cephfs_metadata
Check that both NFS Ganesha nodes have the file
/etc/ganesha/ganesha.conf
files installed. The
nfs-ganesha-ceph package ships with a sample
/etc/ganesha/ceph.conf
file that you can tweak as
needed. Following is an example of ceph.conf
:
NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } NFSv4 { RecoveryBackend = rados_cluster; Minor_Versions = 1,2; } CACHEINODE { Dir_Chunk = 0; NParts = 1; Cache_Size = 1; } EXPORT { Export_ID=100; Protocols = 4; Transports = TCP; Path = /; Pseudo = /ceph/; Access_Type = RW; Attr_Expiration_Time = 0; FSAL { Name = CEPH; } } RADOS_KV { pool = "cephfs_metadata"; namespace = "ganesha"; #nodeid = "a"; }
Because legacy versions of NFS prevent us from lifting the grace period early and therefore prolong a server restart, we disable options for NFS prior to version 4.2. We also disable most of the NFS Ganesha caching as Ceph libraries do aggressive caching already.
The 'rados_cluster' recovery back-end stores its info in RADOS objects. Although it is not a lot of data, we want it highly available. We use the CephFS metadata pool for this purpose, and declare a new 'ganesha' namespace in it to keep it distinct from CephFS objects.
Most of the configuration is identical between the two hosts, however the
nodeid
option in the 'RADOS_KV' block needs to be a
unique string for each node. By default, NFS Ganesha sets
nodeid
to the host name of the node.
If you need to use a different fixed values other than host names, you can
for example set nodeid = 'a'
on one node and
nodeid = 'b'
on the other one.
We need to verify that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. NFS Ganesha uses this object to communicate the current state with regard to a grace period.
The nfs-ganesha-rados-grace package contains a command line tool for querying and manipulating this database. If the package is not installed on at least one of the nodes, install it with
root #
zypper install nfs-ganesha-rados-grace
We will use the command to create the DB and add both
nodeid
's. In our example, the two NFS Ganesha nodes are
named ses6min1.example.com
and
ses6min2.example.com
One of the NFS Ganesha hosts, run
cephadm >
ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min1.example.comcephadm >
ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min2.example.comcephadm >
ganesha-rados-grace -p cephfs_metadata -n ganesha cur=1 rec=0 ====================================================== ses6min1.example.com E ses6min2.example.com E
This creates the grace database and adds both 'ses6min1.example.com' and 'ses6min2.example.com' to it. The last command dumps the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the 'E' flag set. The 'cur' and 'rec' values show the current and recovery epochs, which is how we keep track of what hosts are allowed to perform recovery and when.
On both NFS Ganesha nodes, restart the related services:
root #
systemctl restart nfs-ganesha.service
After the services are restarted, check the grace database:
cephadm >
ganesha-rados-grace -p cephfs_metadata -n ganesha
cur=3 rec=0
======================================================
ses6min1.example.com
ses6min2.example.com
Note that both nodes have cleared their 'E' flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.
After you complete all the preceding steps, you can mount the exported NFS from either of the two NFS Ganesha servers, and perform normal NFS operations against them.
Our example configuration assumes that if one of the two NFS Ganesha servers goes down, that you will restart it manually within 5 minutes. After 5 minutes, the Metadata Server may cancel the session that the NFS Ganesha client held and all of the state associated with it. If the session’s capabilities get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.
More information can be found in Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”.
This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster.
Running containerized Ceph cluster on SUSE CaaS Platform is a technology preview. Do not deploy on a production Kubernetes cluster.
This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster.
Before you start deploying, consider the following points:
To run Ceph in Kubernetes, SUSE Enterprise Storage 6 uses an upstream project called Rook (https://rook.io/).
Depending on the configuration, Rook may consume all unused disks on all nodes in a Kubernetes cluster.
The setup requires privileged containers.
Before you start deploying, you need to have:
A running SUSE CaaS Platform 4 cluster.
SUSE CaaS Platform 4 worker nodes with a number of extra disks attached as a storage for the Ceph cluster.
The Rook orchestrator uses configuration files in YAML format called manifests. The manifests you need are included in the rook-k8s-yaml RPM package. Install it by running
root #
zypper install rook-k8s-yaml
Rook-Ceph includes two main components: the 'operator' which is run by Kubernetes and allows creation of Ceph clusters, and the Ceph 'cluster' itself which is created and partially managed by the operator.
The manifests used in this setup install all Rook and Ceph components in the 'rook-ceph' namespace. If you need to change it, adopt all references to the namespace in the Kubernetes manifests accordingly.
Depending on which features of Rook you intend to use, alter the 'Pod
Security Policy' configuration in common.yaml
to
limit Rook's security requirements. Follow the comments in the manifest
file.
The manifest operator.yaml
configures the Rook
operator. Normally, you do not need to change it. Find more information
following the comments in the manifest file.
The manifest cluster.yaml
is responsible for
configuring the actual Ceph cluster which will run in Kubernetes. Find
detailed description of all available options in the upstream Rook
documentation at
https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html.
By default, Rook is configured to use all nodes which are not tainted
with node-role.kubernetes.io/master:NoSchedule
and will
obey configured placement settings (see
https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings).
The following example disables such behavior and only uses the nodes
explicitly listed in the nodes section:
storage: useAllNodes: false nodes: - name: caasp4-worker-0 - name: caasp4-worker-1 - name: caasp4-worker-2
By default, Rook is configured to use all free and empty disks on each node for use as a Ceph storage.
The Rook-Ceph upstream documentation at https://rook.github.io/docs/rook/master/ceph-storage.html contains more detailed information about configuring more advanced deployments. Use it as a reference for understanding the basics of Rook-Ceph before doing more advanced configurations.
Find more details about the SUSE CaaS Platform product at https://www.suse.com/documentation/suse-caasp.
Install the Rook-Ceph common components, CSI roles, and the Rook-Ceph operator by executing the following command on the SUSE CaaS Platform master node:
root #
kubectl apply -f common.yaml -f operator.yaml
common.yaml
will create the 'rook-ceph' namespace,
Ceph Custom Resource Definitions (CRDs) (see
https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
to make Kubernetes aware of Ceph Objects (for example 'CephCluster'), and the
RBAC roles and Pod Security Policies (see
https://kubernetes.io/docs/concepts/policy/pod-security-policy/)
which are necessary for allowing Rook to manage the cluster-specific
resources.
hostNetwork
and hostPorts
Usage
Allowing the usage of hostNetwork
is required when using
hostNetwork: true
in the Cluster Resource Definition.
Allowing the usage of hostPorts
in the
PodSecurityPolicy
is also required.
Verify the installation by running kubectl get pods -n
rook-ceph
on SUSE CaaS Platform master node, for example:
root #
kubectl get pods -n rook-ceph
NAME READY STATUS RESTARTS AGE
rook-ceph-agent-57c9j 1/1 Running 0 22h
rook-ceph-agent-b9j4x 1/1 Running 0 22h
rook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22h
rook-discover-mb8gv 1/1 Running 0 22h
rook-discover-tztz4 1/1 Running 0 22h
After you modify cluster.yaml
to your needs, you can
create the Ceph cluster. Run the following command on the SUSE CaaS Platform master
node:
root #
kubectl apply -f cluster.yaml
Watch the 'rook-ceph' namespace to see the Ceph cluster begin created.
You will see as many Ceph Monitors as configured in the
cluster.yaml
manifest (default is 3), one Ceph Manager, and
as many Ceph OSDs as you have free disks.
While bootstrapping the Ceph cluster, you will see some pods with the
name
rook-ceph-osd-prepare-NODE-NAME
run for a while and then terminate with the status 'Completed'. As their
name implies, these pods provision Ceph OSDs. They are left without being
deleted so that you can inspect their logs after their termination. For
example:
root #
kubectl get pods --namespace rook-ceph
NAME READY STATUS RESTARTS AGE
rook-ceph-agent-57c9j 1/1 Running 0 22h
rook-ceph-agent-b9j4x 1/1 Running 0 22h
rook-ceph-mgr-a-6d48564b84-k7dft 1/1 Running 0 22h
rook-ceph-mon-a-cc44b479-5qvdb 1/1 Running 0 22h
rook-ceph-mon-b-6c6565ff48-gm9wz 1/1 Running 0 22h
rook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22h
rook-ceph-osd-0-57bf997cbd-4wspg 1/1 Running 0 22h
rook-ceph-osd-1-54cf468bf8-z8jhp 1/1 Running 0 22h
rook-ceph-osd-prepare-caasp4-worker-0-f2tmw 0/2 Completed 0 9m35s
rook-ceph-osd-prepare-caasp4-worker-1-qsfhz 0/2 Completed 0 9m33s
rook-ceph-tools-76c7d559b6-64rkw 1/1 Running 0 22h
rook-discover-mb8gv 1/1 Running 0 22h
rook-discover-tztz4 1/1 Running 0 22h
Rook allows you to use three different types of storage:
Object storage exposes an S3 API to the storage cluster for applications to put and get data. Refer to https://rook.io/docs/rook/v1.0/ceph-object.html for a detailed description.
A shared file system can be mounted with read/write permission from multiple pods. This is useful for applications that are clustered using a shared file system. Refer to https://rook.io/docs/rook/v1.0/ceph-filesystem.html for a detailed description.
Block storage allows you to mount storage to a single pod. Refer to https://rook.io/docs/rook/v1.0/ceph-block.html for a detailed description.
To uninstall Rook, follow these steps:
Delete any Kubernetes applications which are consuming Rook storage.
Delete all object, file, and/or block storage artifacts that you created by following Section 12.5, “Using Rook as Storage for Kubernetes Workload”.
Delete the Ceph cluster, operator, and related resources:
root #
kubectl delete -f cluster.yamlroot #
kubectl delete -f operator.yamlroot #
kubectl delete -f common.yaml
Delete the data on hosts:
root #
rm -rf /var/lib/rook
If necessary, wipe the disks that were used by Rook. Refer to https://rook.io/docs/rook/master/ceph-teardown.html for more details.
Several key packages in SUSE Enterprise Storage 6 are based on the Nautilus release series of Ceph. When the Ceph project (https://github.com/ceph/ceph) publishes new point releases in the Nautilus series. SUSE Enterprise Storage 6 is updated to ensure that the product benefits from the latest upstream bugfixes and feature backports.
This chapter contains summaries of notable changes contained in each upstream point release that has been—or is planned to be—included in the product.
This point release fixes a serious regression that found its way into the 14.2.3 point release. This regression did not affect SUSE Enterprise Storage customers because we did not ship a version based on 14.2.3.
Fixed a denial of service vulnerability where an unauthenticated client of Ceph Object Gateway could trigger a crash from an uncaught exception.
Nautilus-based librbd clients can now open images on Jewel clusters.
The Object Gateway num_rados_handles
has been removed. If you were
using a value of num_rados_handles
greater than 1,
multiply your current objecter_inflight_ops
and
objecter_inflight_op_bytes
parameters by the old
num_rados_handles
to get the same throttle behavior.
The secure mode of Messenger v2 protocol is no longer experimental with this release. This mode is now the preferred mode of connection for monitors.
osd_deep_scrub_large_omap_object_key_threshold
has been
lowered to detect an object with large number of omap keys more easily.
The Ceph Dashboard now supports silencing Prometheus notifications.
The no{up,down,in,out}
related commands have been
revamped. There are now two ways to set the
no{up,down,in,out}
flags: the old command
ceph osd [un]set FLAG
which sets cluster-wide flags; and the new command
ceph osd [un]set-group FLAGS WHO
which sets flags in batch at the granularity of any crush node, or device class.
radosgw-admin
introduces two subcommands that allow the
managing of expire-stale objects that might be left behind after a bucket
reshard in earlier versions of Object Gateway. One subcommand lists such objects and
the other deletes them.
Earlier Nautilus releases (14.2.1 and 14.2.0) have an issue where
deploying a single new Nautilus BlueStore OSD on an upgraded cluster
(i.e. one that was originally deployed pre-Nautilus) breaks the pool
utilization statistics reported by ceph df
. Until all
OSDs have been reprovisioned or updated (via ceph-bluestore-tool
repair
), the pool statistics will show values that are lower than
the true value. This is resolved in 14.2.2, such that the cluster only
switches to using the more accurate per-pool stats after
all OSDs are 14.2.2 or later, are Block Storage, and
have been updated via the repair function if they were created prior to
Nautilus.
The default value for mon_crush_min_required_version
has
been changed from firefly
to hammer
,
which means the cluster will issue a health warning if your CRUSH tunables
are older than Hammer. There is generally a small (but non-zero) amount of
data that will move around by making the switch to Hammer tunables.
If possible, we recommend that you set the oldest allowed client to
hammer
or later. To display what the current oldest
allowed client is, run:
cephadm >
ceph osd dump | grep min_compat_client
If the current value is older than hammer
, run the
following command to determine whether it is safe to make this change by
verifying that there are no clients older than Hammer currently connected
to the cluster:
cephadm >
ceph features
The newer straw2
CRUSH bucket type was introduced in
Hammer. If you verify that all clients are Hammer or newer, it allows new
features only supported for straw2
buckets to be used,
including the crush-compat
mode for the Balancer
(Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.1 “Balancer”).
Find detailed information about the patch at https://download.suse.com/Download?buildid=D38A7mekBz4~
This was the first point release following the original Nautilus release (14.2.0). The original ('General Availability' or 'GA') version of SUSE Enterprise Storage 6 was based on this point release.
The node from which you run the ceph-deploy
utility to
deploy Ceph on OSD nodes.
A point which aggregates other nodes into a hierarchy of physical locations.
S3 buckets or containers represent different terms meaning folders for storing objects.
Controlled Replication Under Scalable Hashing: An algorithm that determines how to store and retrieve data by computing data storage locations. CRUSH requires a map of the cluster to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.
A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.
Any single machine or server in a Ceph cluster.
Depending on context, Object Storage Device or
Object Storage Daemon. The
ceph-osd
daemon is the component of Ceph that is
responsible for storing objects on a local file system and providing
access to them over the network.
A cluster node that stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.
Placement Group: a sub-division of a pool, used for performance tuning.
Logical partitions for storing objects such as disk images.
Rules to determine data placement for a pool.
The core set of storage software which stores the user’s data. Such a set consists of Ceph monitors and OSDs.
AKA “Ceph Object Store”.
This chapter lists content changes for this document since the release of the latest maintenance update of SUSE Enterprise Storage 5. You can find changes related to the cluster deployment that apply to previous versions in https://www.suse.com/documentation/suse-enterprise-storage-5/book_storage_deployment/data/ap_deploy_docupdate.html.
The document was updated on the following dates:
Added Book “Administration Guide”, Chapter 13 “Improving Performance with LVM cache” (jsc#SES-269).
Added Chapter 12, SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes Cluster (jsc#SES-720).
Made the upgrade chapter sequential, Chapter 5, Upgrading from Previous Releases (https://bugzilla.suse.com/show_bug.cgi?id=1144709).
Added changelog entry for Ceph 14.2.4 (https://bugzilla.suse.com/show_bug.cgi?id=1151881).
Unified the pool name 'cephfs_metadata' in examples in Chapter 11, Installation of NFS Ganesha (https://bugzilla.suse.com/show_bug.cgi?id=1148548).
Updated Section 4.5.2.1, “Specification” to include more realistic values (https://bugzilla.suse.com/show_bug.cgi?id=1148216).
Added two new repositories for 'Module-Desktop' as our customers use mostly GUI in Section 5.8.1, “Manual Node Upgrade using the Installer DVD” (https://bugzilla.suse.com/show_bug.cgi?id=1144897).
deepsea-cli is not a dependency of deepsea in Section 4.4, “DeepSea CLI” (https://bugzilla.suse.com/show_bug.cgi?id=1143602).
Added a hint to migrate ntpd
to
chronyd
in
Section 5.1, “Points to Consider before the Upgrade”
(https://bugzilla.suse.com/show_bug.cgi?id=1135185).
Added Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.15 “Deactivating Tuned Profiles” (https://bugzilla.suse.com/show_bug.cgi?id=1130430).
Consider migrating whole OSD node in Section 5.16.3, “OSD Deployment” (https://bugzilla.suse.com/show_bug.cgi?id=1138691).
Added a point about migrating MDS names in Section 5.1, “Points to Consider before the Upgrade” (https://bugzilla.suse.com/show_bug.cgi?id=1138804).
Added Section 4.5.2, “DriveGroups” (jsc#SES-548).
Rewrote Chapter 5, Upgrading from Previous Releases (jsc#SES-88).
Added Section 6.2.1, “Enabling IPv6 for Ceph Cluster Deployment” (jsc#SES-409).
Made Block Storage the default storage back-end (Fate#325658).
Removed all references to external online documentation, replaced with the relevant content (Fate#320121).
Added information about AppArmor during upgrade in Section 5.1, “Points to Consider before the Upgrade” (https://bugzilla.suse.com/show_bug.cgi?id=1137945).
Added information about CDTB clusters not supporting online upgrade in Section 5.1, “Points to Consider before the Upgrade” (https://bugzilla.suse.com/show_bug.cgi?id=1129108).
Added information on co-location of Ceph services on High Availability setups in Section 11.3, “High Availability Active-Passive Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1136871).
Added a tip about orphaned packages in Section 5.8, “Per Node Upgrade—Basic Procedure” (https://bugzilla.suse.com/show_bug.cgi?id=1136624).
Updated profile-*
with
role-storage
in Tip: Deploying Monitor Nodes without Defining OSD Profiles
(https://bugzilla.suse.com/show_bug.cgi?id=1138181).
Added Section 5.16, “Migration from Profile-based Deployments to DriveGroups” (https://bugzilla.suse.com/show_bug.cgi?id=1135340).
Added Section 5.16, “Migration from Profile-based Deployments to DriveGroups” (https://bugzilla.suse.com/show_bug.cgi?id=1135340).
Added Section 5.11, “Upgrade Metadata Servers” (https://bugzilla.suse.com/show_bug.cgi?id=1135064).
MDS cluster needs to be shrunk in Section 5.1, “Points to Consider before the Upgrade” (https://bugzilla.suse.com/show_bug.cgi?id=1134826).
Changed configuration file to
/srv/pillar/ceph/stack/global.yml
(https://bugzilla.suse.com/show_bug.cgi?id=1129191).
Updated various parts of Book “Administration Guide”, Chapter 18 “Exporting Ceph Data via Samba” (https://bugzilla.suse.com/show_bug.cgi?id=1101478).
master_minion.sls
is gone in
Section 4.3, “Cluster Deployment”
(https://bugzilla.suse.com/show_bug.cgi?id=1090921).
Mentioned the deepsea-cli package in Section 4.4, “DeepSea CLI” (https://bugzilla.suse.com/show_bug.cgi?id=1087454).