Jump to content
SUSE Enterprise Storage 6

Deployment Guide

Authors: Tomáš Bažant, Jana Haláčková, and Sven Seeberg
Publication Date: 12/03/2019
About This Guide
Available Documentation
Feedback
Documentation Conventions
About the Making of This Manual
Ceph Contributors
I SUSE Enterprise Storage
1 SUSE Enterprise Storage 6 and Ceph
1.1 Ceph Features
1.2 Core Components
1.3 Storage Structure
1.4 BlueStore
1.5 Additional Information
2 Hardware Requirements and Recommendations
2.1 Multiple Architecture Configurations
2.2 Minimum Cluster Configuration
2.3 Object Storage Nodes
2.4 Monitor Nodes
2.5 Object Gateway Nodes
2.6 Metadata Server Nodes
2.7 Salt Master
2.8 iSCSI Nodes
2.9 Network Recommendations
2.10 Naming Limitations
2.11 OSD and Monitor Sharing One Server
2.12 Recommended Production Cluster Configuration
2.13 SUSE Enterprise Storage 6 and Other SUSE Products
3 Admin Node HA Setup
3.1 Outline of the HA Cluster for Admin Node
3.2 Building HA Cluster with Admin Node
II Cluster Deployment and Upgrade
4 Deploying with DeepSea/Salt
4.1 Read the Release Notes
4.2 Introduction to DeepSea
4.3 Cluster Deployment
4.4 DeepSea CLI
4.5 Configuration and Customization
5 Upgrading from Previous Releases
5.1 Points to Consider before the Upgrade
5.2 Backup Cluster Data
5.3 Migrate from ntpd to chronyd
5.4 Patch Cluster Prior to Upgrade
5.5 Verify the Current Environment
5.6 Check the Cluster's State
5.7 Offline Upgrade of CTDB Clusters
5.8 Per Node Upgrade—Basic Procedure
5.9 Upgrade the Admin Node
5.10 Upgrade Ceph Monitor/Ceph Manager Nodes
5.11 Upgrade Metadata Servers
5.12 Upgrade Ceph OSDs
5.13 OSD Migration to BlueStore
5.14 Upgrade Application Nodes
5.15 Update policy.cfg and Deploy Ceph Dashboard using DeepSea
5.16 Migration from Profile-based Deployments to DriveGroups
6 Customizing the Default Configuration
6.1 Using Customized Configuration Files
6.2 Modifying Discovered Configuration
III Installation of Additional Services
7 Installation of Services to Access your Data
8 Ceph Object Gateway
8.1 Object Gateway Manual Installation
9 Installation of iSCSI Gateway
9.1 iSCSI Block Storage
9.2 General Information about ceph-iscsi
9.3 Deployment Considerations
9.4 Installation and Configuration
9.5 Exporting RADOS Block Device Images using tcmu-runner
10 Installation of CephFS
10.1 Supported CephFS Scenarios and Guidance
10.2 Ceph Metadata Server
10.3 CephFS
11 Installation of NFS Ganesha
11.1 Preparation
11.2 Example Installation
11.3 High Availability Active-Passive Configuration
11.4 Active-Active Configuration
11.5 More Information
IV Cluster Deployment on top of SUSE CaaS Platform 4 (Technology Preview)
12 SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes Cluster
12.1 Considerations
12.2 Prerequisites
12.3 Get Rook Manifests
12.4 Installation
12.5 Using Rook as Storage for Kubernetes Workload
12.6 Uninstalling Rook
A Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases
Glossary
B Documentation Updates
B.1 2019 (Maintenance update of SUSE Enterprise Storage 6 documentation)
B.2 June, 2019 (Release of SUSE Enterprise Storage 6)

Copyright © 2019 SUSE LLC

Copyright © 2016, RedHat, Inc, and contributors.

The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. All other trademarks are the property of their respective owners.

For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

About This Guide Edit source

SUSE Enterprise Storage 6 is an extension to SUSE Linux Enterprise Server 15 SP1. It combines the capabilities of the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage 6 provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage 6 with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.

For an overview of the documentation available for your product and the latest documentation updates, refer to http://www.suse.com/documentation.

1 Available Documentation Edit source

The following manuals are available for this product:

Book “Administration Guide”

The guide describes various administration tasks that are typically performed after the installation. The guide also introduces steps to integrate Ceph with virtualization solutions such as libvirt, Xen, or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.

Deployment Guide

Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual. Find the latest documentation updates at http://www.suse.com/documentation where you can download the manuals for your product in multiple formats.

2 Feedback Edit source

Several feedback channels are available:

Bugs and Enhancement Requests

For services and support options available for your product, refer to http://www.suse.com/support/.

To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select My Support › Service Request.

User Comments

We want to hear your comments and suggestions for this manual and the other documentation included with this product. Use the User Comments feature at the bottom of each page in the online documentation or go to http://www.suse.com/documentation/feedback.html and enter your comments there.

Mail

For feedback on the documentation of this product, you can also send a mail to doc-team@suse.de. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions Edit source

The following typographical conventions are used in this manual:

  • /etc/passwd: directory names and file names

  • placeholder: replace placeholder with the actual value

  • PATH: the environment variable PATH

  • ls, --help: commands, options, and parameters

  • user: users or groups

  • Alt, AltF1: a key to press or a key combination; keys are shown in uppercase as on a keyboard

  • File, File › Save As: menu items, buttons

  • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual Edit source

This book is written in GeekoDoc, a subset of DocBook (see http://www.docbook.org). The XML source files were validated by xmllint, processed by xsltproc, and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps. The DocBook Authoring and Publishing Suite (DAPS) is developed as open source software. For more information, see http://daps.sf.net/.

5 Ceph Contributors Edit source

The Ceph project and its documentation is a result of hundreds of contributors and organizations. See https://ceph.com/contributors/ for more details.

Part I SUSE Enterprise Storage Edit source

1 SUSE Enterprise Storage 6 and Ceph

SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referr…

2 Hardware Requirements and Recommendations

The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.

3 Admin Node HA Setup

Admin Node is a Ceph cluster node where the Salt master service is running. The admin node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph …

1 SUSE Enterprise Storage 6 and Ceph Edit source

SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referred to as nodes) and into the petabyte range. As opposed to conventional systems which have allocation tables to store and fetch data, Ceph uses a deterministic algorithm to allocate storage for data and has no centralized information structure. Ceph assumes that in storage clusters the addition or removal of hardware is the rule, not the exception. The Ceph cluster automates management tasks such as data distribution and redistribution, data replication, failure detection and recovery. Ceph is both self-healing and self-managing which results in a reduction of administrative and budget overhead.

This chapter provides a high level overview of SUSE Enterprise Storage 6 and briefly describes the most important components.

Tip
Tip

Since SUSE Enterprise Storage 5, the only cluster deployment method is DeepSea. Refer to Chapter 4, Deploying with DeepSea/Salt for details about the deployment process.

1.1 Ceph Features Edit source

The Ceph environment has the following features:

Scalability

Ceph can scale to thousands of nodes and manage storage in the range of petabytes.

Commodity Hardware

No special hardware is required to run a Ceph cluster. For details, see Chapter 2, Hardware Requirements and Recommendations

Self-managing

The Ceph cluster is self-managing. When nodes are added, removed or fail, the cluster automatically redistributes the data. It is also aware of overloaded disks.

No Single Point of Failure

No node in a cluster stores important information alone. The number of redundancies can be configured.

Open Source Software

Ceph is an open source software solution and independent of specific hardware or vendors.

1.2 Core Components Edit source

To make full use of Ceph's power, it is necessary to understand some of the basic components and concepts. This section introduces some parts of Ceph that are often referenced in other chapters.

1.2.1 RADOS Edit source

The basic component of Ceph is called RADOS (Reliable Autonomic Distributed Object Store). It is responsible for managing the data stored in the cluster. Data in Ceph is usually stored as objects. Each object consists of an identifier and the data.

RADOS provides the following access methods to the stored objects that cover many use cases:

Object Gateway

Object Gateway is an HTTP REST gateway for the RADOS object store. It enables direct access to objects stored in the Ceph cluster.

RADOS Block Device

RADOS Block Devices (RBD) can be accessed like any other block device. These can be used for example in combination with libvirt for virtualization purposes.

CephFS

The Ceph File System is a POSIX-compliant file system.

librados

librados is a library that can be used with many programming languages to create an application capable of directly interacting with the storage cluster.

librados is used by Object Gateway and RBD while CephFS directly interfaces with RADOS Figure 1.1, “Interfaces to the Ceph Object Store”.

Interfaces to the Ceph Object Store
Figure 1.1: Interfaces to the Ceph Object Store

1.2.2 CRUSH Edit source

At the core of a Ceph cluster is the CRUSH algorithm. CRUSH is the acronym for Controlled Replication Under Scalable Hashing. CRUSH is a function that handles the storage allocation and needs comparably few parameters. That means only a small amount of information is necessary to calculate the storage position of an object. The parameters are a current map of the cluster including the health state, some administrator-defined placement rules and the name of the object that needs to be stored or retrieved. With this information, all nodes in the Ceph cluster are able to calculate where an object and its replicas are stored. This makes writing or reading data very efficient. CRUSH tries to evenly distribute data over all nodes in the cluster.

The CRUSH map contains all storage nodes and administrator-defined placement rules for storing objects in the cluster. It defines a hierarchical structure that usually corresponds to the physical structure of the cluster. For example, the data-containing disks are in hosts, hosts are in racks, racks in rows and rows in data centers. This structure can be used to define failure domains. Ceph then ensures that replications are stored on different branches of a specific failure domain.

If the failure domain is set to rack, replications of objects are distributed over different racks. This can mitigate outages caused by a failed switch in a rack. If one power distribution unit supplies a row of racks, the failure domain can be set to row. When the power distribution unit fails, the replicated data is still available on other rows.

1.2.3 Ceph Nodes and Daemons Edit source

In Ceph, nodes are servers working for the cluster. They can run several different types of daemons. It is recommended to run only one type of daemon on each node, except for Ceph Manager daemons which can be collocated with Ceph Monitors. Each cluster requires at least Ceph Monitor, Ceph Manager and Ceph OSD daemons:

Admin Node

Admin Node is a Ceph cluster node where the Salt master service is running. The Admin Node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph Dashboard Web UI with the Grafana dashboard backed by the Prometheus monitoring toolkit.

Ceph Monitor

Ceph Monitor (often abbreviated as MON) nodes maintain information about the cluster health state, a map of all nodes and data distribution rules (see Section 1.2.2, “CRUSH”).

If failures or conflicts occur, the Ceph Monitor nodes in the cluster decide by majority which information is correct. To form a qualified majority, it is recommended to have an odd number of Ceph Monitor nodes, and at least three of them.

If more than one site is used, the Ceph Monitor nodes should be distributed over an odd number of sites. The number of Ceph Monitor nodes per site should be such that more than 50% of the Ceph Monitor nodes remain functional if one site fails.

Ceph Manager

The Ceph manager (MGR) collects the state information from the whole cluster. The Ceph manager daemon runs alongside the monitor daemons. It provides additional monitoring, and interfaces the external monitoring and management systems.

The Ceph manager requires no additional configuration, beyond ensuring it is running. You can deploy it as a separate role using DeepSea.

Ceph OSD

An Ceph OSD is a daemon handling Object Storage Devices which are a physical or logical storage units (hard disks or partitions). Object Storage Devices can be physical disks/partitions or logical volumes. The daemon additionally takes care of data replication and rebalancing in case of added or removed nodes.

Ceph OSD daemons communicate with monitor daemons and provide them with the state of the other OSD daemons.

To use CephFS, Object Gateway, NFS Ganesha, or iSCSI Gateway, additional nodes are required:

Metadata Server (MDS)

The Metadata Servers store metadata for the CephFS. By using an MDS you can execute basic file system commands such as ls without overloading the cluster.

Object Gateway

The Object Gateway is an HTTP REST gateway for the RADOS object store. It is compatible with OpenStack Swift and Amazon S3 and has its own user management.

NFS Ganesha

NFS Ganesha provides an NFS access to either the Object Gateway or the CephFS. It runs in the user instead of the kernel space and directly interacts with the Object Gateway or CephFS.

iSCSI Gateway

iSCSI is a storage network protocol that allows clients to send SCSI commands to SCSI storage devices (targets) on remote servers.

Samba Gateway

The Samba Gateway provides a SAMBA access to data stored on CephFS.

1.3 Storage Structure Edit source

1.3.1 Pool Edit source

Objects that are stored in a Ceph cluster are put into pools. Pools represent logical partitions of the cluster to the outside world. For each pool a set of rules can be defined, for example, how many replications of each object must exist. The standard configuration of pools is called replicated pool.

Pools usually contain objects but can also be configured to act similar to a RAID 5. In this configuration, objects are stored in chunks along with additional coding chunks. The coding chunks contain the redundant information. The number of data and coding chunks can be defined by the administrator. In this configuration, pools are referred to as erasure coded pools.

1.3.2 Placement Group Edit source

Placement Groups (PGs) are used for the distribution of data within a pool. When creating a pool, a certain number of placement groups is set. The placement groups are used internally to group objects and are an important factor for the performance of a Ceph cluster. The PG for an object is determined by the object's name.

1.3.3 Example Edit source

This section provides a simplified example of how Ceph manages data (see Figure 1.2, “Small Scale Ceph Example”). This example does not represent a recommended configuration for a Ceph cluster. The hardware setup consists of three storage nodes or Ceph OSDs (Host 1, Host 2, Host 3). Each node has three hard disks which are used as OSDs (osd.1 to osd.9). The Ceph Monitor nodes are neglected in this example.

Note
Note: Difference between Ceph OSD and OSD

While Ceph OSD or Ceph OSD daemon refers to a daemon that is run on a node, the word OSD refers to the logical disk that the daemon interacts with.

The cluster has two pools, Pool A and Pool B. While Pool A replicates objects only two times, resilience for Pool B is more important and it has three replications for each object.

When an application puts an object into a pool, for example via the REST API, a Placement Group (PG1 to PG4) is selected based on the pool and the object name. The CRUSH algorithm then calculates on which OSDs the object is stored, based on the Placement Group that contains the object.

In this example the failure domain is set to host. This ensures that replications of objects are stored on different hosts. Depending on the replication level set for a pool, the object is stored on two or three OSDs that are used by the Placement Group.

An application that writes an object only interacts with one Ceph OSD, the primary Ceph OSD. The primary Ceph OSD takes care of replication and confirms the completion of the write process after all other OSDs have stored the object.

If osd.5 fails, all object in PG1 are still available on osd.1. As soon as the cluster recognizes that an OSD has failed, another OSD takes over. In this example osd.4 is used as a replacement for osd.5. The objects stored on osd.1 are then replicated to osd.4 to restore the replication level.

Small Scale Ceph Example
Figure 1.2: Small Scale Ceph Example

If a new node with new OSDs is added to the cluster, the cluster map is going to change. The CRUSH function then returns different locations for objects. Objects that receive new locations will be relocated. This process results in a balanced usage of all OSDs.

1.4 BlueStore Edit source

BlueStore is a new default storage back end for Ceph since SUSE Enterprise Storage 5. It has better performance than FileStore, full data check-summing, and built-in compression.

BlueStore manages either one, two, or three storage devices. In the simplest case, BlueStore consumes a single primary storage device. The storage device is normally partitioned into two parts:

  1. A small partition named BlueFS that implements file system-like functionalities required by RocksDB.

  2. The rest of the device is normally a large partition occupying the rest of the device. It is managed directly by BlueStore and contains all of the actual data. This primary device is normally identified by a block symbolic link in the data directory.

It is also possible to deploy BlueStore across two additional devices:

A WAL device can be used for BlueStore’s internal journal or write-ahead log. It is identified by the block.wal symbolic link in the data directory. It is only useful to use a separate WAL device if the device is faster than the primary device or the DB device, for example when:

  • The WAL device is an NVMe, and the DB device is an SSD, and the data device is either SSD or HDD.

  • Both the WAL and DB devices are separate SSDs, and the data device is an SSD or HDD.

A DB device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the embedded RocksDB) will put as much metadata as it can on the DB device to improve performance. Again, it is only helpful to provision a shared DB device if it is faster than the primary device.

Tip
Tip: Plan for the DB Size

Plan thoroughly for the sufficient size of the DB device. If the DB device fills up, metadata will be spilling over to the primary device which badly degrades the OSD's performance.

You can check if a WAL/DB partition is getting full and spilling over with the ceph daemon osd.ID perf dump command. The slow_used_bytes value shows the amount of data being spilled out:

cephadm > ceph daemon osd.ID perf dump | jq '.bluefs'
"db_total_bytes": 1073741824,
"db_used_bytes": 33554432,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 554432,
"slow_used_bytes": 554432,

1.5 Additional Information Edit source

  • Ceph as a community project has its own extensive online documentation. For topics not found in this manual, refer to http://docs.ceph.com/docs/master/.

  • The original publication CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data by S.A. Weil, S.A. Brandt, E.L. Miller, C. Maltzahn provides helpful insight into the inner workings of Ceph. Especially when deploying large scale clusters it is a recommended reading. The publication can be found at http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf.

2 Hardware Requirements and Recommendations Edit source

The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.

In general, the recommendations given in this section are on a per-process basis. If several processes are located on the same machine, the CPU, RAM, disk and network requirements need to be added up.

2.1 Multiple Architecture Configurations Edit source

SUSE Enterprise Storage supports both x86 and Arm architectures. When considering each architecture, it is important to note that from a cores per OSD, frequency, and RAM perspective, there is no real difference between CPU architectures for sizing.

Like with smaller x86 processors (non-server), lower-performance Arm-based cores may not provide an optimal experience, especially when used for erasure coded pools.

2.2 Minimum Cluster Configuration Edit source

  • At least four OSD nodes, with eight OSD disks each, are required.

  • Three Ceph Monitor nodes (requires SSD for dedicated OS drive).

  • iSCSI Gateways, Object Gateways and Metadata Servers require incremental 4 GB RAM and four cores.

  • Ceph Monitors, Object Gateways and Metadata Servers nodes require redundant deployment.

  • Separate Admin Node with 4 GB RAM, four cores, 1 TB capacity. This is typically the Salt master node. Ceph services and gateways, such as Ceph Monitor, Ceph Manager, Metadata Server, Ceph OSD, Object Gateway, or NFS Ganesha are not supported on the Admin Node.

2.3 Object Storage Nodes Edit source

2.3.1 Minimum Requirements Edit source

  • CPU recommendations:

    • 1x 2GHz CPU Thread per spinner

    • 2x 2GHz CPU Thread per SSD

    • 4x 2GHz CPU Thread per NVMe

  • Separate 10 GbE networks (public/client and back-end), required 4x 10 GbE, recommended 2x 25 GbE.

  • Total RAM required = number of OSDs x (1 GB + osd_memory_target) + 16 GB

    Refer to Book “Administration Guide”, Chapter 14 “Ceph Cluster Configuration”, Section 14.2.1 “Automatic Cache Sizing” for more details on osd_memory_target.

  • OSD disks in JBOD configurations or or individual RAID-0 configurations.

  • OSD journal can reside on OSD disk.

  • OSD disks should be exclusively used by SUSE Enterprise Storage.

  • Dedicated disk/SSD for the operating system, preferably in a RAID 1 configuration.

  • If this OSD host will host part of a cache pool used for cache tiering, allocate at least an additional 4 GB of RAM.

  • Ceph Monitors, gateway and Metadata Servers can reside on Object Storage Nodes.

  • OSD nodes should be bare metal, not virtualized, for disk performance reasons.

2.3.2 Minimum Disk Size Edit source

There are two types of disk space needed to run on OSD: the space for the disk journal (for FileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. The minimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is 5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0.

So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smaller than 20 GB, even for testing purposes.

2.3.3 Recommended Size for the BlueStore's WAL and DB Device Edit source

Tip
Tip: More Information

Refer to Section 1.4, “BlueStore” for more information on BlueStore.

  • We recommend reserving 4 GB for the WAL device. Recommended size for DB is 64 GB for more workloads.

  • If you intend to put the WAL and DB device on the same disk, then we recommend using a single partition for both devices, rather than having a separate partition for each. This allows Ceph to use the DB device for the WAL operation as well. Management of the disk space is therefore more effective as Ceph uses the DB partition for the WAL only if there is a need for it. Another advantage is that the probability that the WAL partition gets full is very small, and when it is not entirely used then its space is not wasted but used for DB operation.

    To share the DB device with the WAL, do not specify the WAL device, and specify only the DB device.

    Find more information about specifying an OSD layout in Section 4.5.2, “DriveGroups”.

2.3.4 Using SSD for OSD Journals Edit source

Solid-state drives (SSD) have no moving parts. This reduces random access time and read latency while accelerating data throughput. Because their price per 1MB is significantly higher than the price of spinning hard disks, SSDs are only suitable for smaller storage.

OSDs may see a significant performance improvement by storing their journal on an SSD and the object data on a separate hard disk.

Tip
Tip: Sharing an SSD for Multiple Journals

As journal data occupies relatively little space, you can mount several journal directories to a single SSD disk. Keep in mind that with each shared journal, the performance of the SSD disk degrades. We do not recommend sharing more than six journals on the same SSD disk and 12 on NVMe disks.

2.3.5 Maximum Recommended Number of Disks Edit source

You can have as many disks in one server as it allows. There are a few things to consider when planning the number of disks per server:

  • Network bandwidth. The more disks you have in a server, the more data must be transferred via the network card(s) for the disk write operations.

  • Memory. RAM above 2GB is used for the BlueStore cache. With the default osd_memory_target of 4GB, the system has a reasonable starting cache size for spinning media. If using SSD or NVME, consider increasing the cache size and RAM allocation per OSD to maximize performance.

  • Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the cluster temporarily loses. Moreover, to keep the replication rules running, you need to copy all the data from the failed server among the other nodes in the cluster.

2.4 Monitor Nodes Edit source

  • At least three Ceph Monitor nodes are required. The number of monitors should always be odd (1+2n).

  • 4 GB of RAM.

  • Processor with four logical cores.

  • An SSD or other sufficiently fast storage type is highly recommended for monitors, specifically for the /var/lib/ceph path on each monitor node, as quorum may be unstable with high disk latencies. Two disks in RAID 1 configuration is recommended for redundancy. It is recommended that separate disks or at least separate disk partitions are used for the monitor processes to protect the monitor's available disk space from things like log file creep.

  • There must only be one monitor process per node.

  • Mixing OSD, monitor, or Object Gateway nodes is only supported if sufficient hardware resources are available. That means that the requirements for all services need to be added up.

  • Two network interfaces bonded to multiple switches.

2.5 Object Gateway Nodes Edit source

Object Gateway nodes should have six to eight CPU cores and 32 GB of RAM (64 GB recommended). When other processes are co-located on the same machine, their requirements need to be added up.

2.6 Metadata Server Nodes Edit source

Proper sizing of the Metadata Server nodes depends on the specific use case. Generally, the more open files the Metadata Server is to handle, the more CPU and RAM it needs. Following are the minimal requirements:

  • 3G of RAM per one Metadata Server daemon.

  • Bonded network interface.

  • 2.5 GHz CPU with at least 2 cores.

2.7 Salt Master Edit source

At least 4 GB of RAM and a quad-core CPU are required. This is includes running the Ceph Dashboard on the Admin Node. For large clusters with hundreds of nodes, 6 GB of RAM is suggested.

2.8 iSCSI Nodes Edit source

iSCSI nodes should have six to eight CPU cores and 16 GB of RAM.

2.9 Network Recommendations Edit source

The network environment where you intend to run Ceph should ideally be a bonded set of at least two network interfaces that is logically split into a public part and a trusted internal part using VLANs. The bonding mode is recommended to be 802.3ad if possible to provide maximum bandwidth and resiliency.

The public VLAN serves to provide the service to the customers, while the internal part provides for the authenticated Ceph network communication. The main reason for this is that although Ceph provides authentication and protection against attacks once secret keys are in place, the messages used to configure these keys may be transferred openly and are vulnerable.

Tip
Tip: Nodes Configured via DHCP

If your storage nodes are configured via DHCP, the default timeouts may not be sufficient for the network to be configured correctly before the various Ceph daemons start. If this happens, the Ceph MONs and OSDs will not start correctly (running systemctl status ceph\* will result in "unable to bind" errors) To avoid this issue, we recommend increasing the DHCP client timeout to at least 30 seconds on each node in your storage cluster. This can be done by changing the following settings on each node:

In /etc/sysconfig/network/dhcp, set

DHCLIENT_WAIT_AT_BOOT="30"

In /etc/sysconfig/network/config, set

WAIT_FOR_INTERFACES="60"

2.9.1 Adding a Private Network to a Running Cluster Edit source

If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates fine with a public network, its performance and security improves when you set a second private cluster network. To support two networks, each Ceph node needs to have at least two network cards.

You need to apply the following changes to each Ceph node. It is relatively quick to do for a small cluster, but can be very time consuming if you have a cluster consisting of hundreds or thousands of nodes.

  1. Stop Ceph related services on each cluster node.

    Add a line to /etc/ceph/ceph.conf to define the cluster network, for example:

    cluster network = 10.0.0.0/24

    If you need to specifically assign static IP addresses or override cluster network settings, you can do so with the optional cluster addr.

  2. Check that the private cluster network works as expected on the OS level.

  3. Start Ceph related services on each cluster node.

    root # systemctl start ceph.target

2.9.2 Monitor Nodes on Different Subnets Edit source

If the monitor nodes are on multiple subnets, for example they are located in different rooms and served by different switches, you need to adjust the ceph.conf file accordingly. For example if the nodes have IP addresses 192.168.123.12, 1.2.3.4, and 242.12.33.12, add the following lines to its global section:

[global]
[...]
mon host = 192.168.123.12, 1.2.3.4, 242.12.33.12
mon initial members = MON1, MON2, MON3
[...]

Additionally, if you need to specify a per-monitor public address or network, you need to add a [mon.X] section per each monitor:

[mon.MON1]
public network = 192.168.123.0/24

[mon.MON2]
public network = 1.2.3.0/24

[mon.MON3]
public network = 242.12.33.12/0

2.10 Naming Limitations Edit source

Ceph does not generally support non-ASCII characters in configuration files, pool names, user names and so forth. When configuring a Ceph cluster we recommend using only simple alphanumeric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/configuration names.

2.11 OSD and Monitor Sharing One Server Edit source

Although it is technically possible to run Ceph OSDs and Monitors on the same server in test environments, we strongly recommend having a separate server for each monitor node in production. The main reason is performance—the more OSDs the cluster has, the more I/O operations the monitor nodes need to perform. And when one server is shared between a monitor node and OSD(s), the OSD I/O operations are a limiting factor for the monitor node.

Another consideration is whether to share disks between an OSD, a monitor node, and the operating system on the server. The answer is simple: if possible, dedicate a separate disk to OSD, and a separate server to a monitor node.

Although Ceph supports directory-based OSDs, an OSD should always have a dedicated disk other than the operating system one.

Tip
Tip

If it is really necessary to run OSD and monitor node on the same server, run the monitor on a separate disk by mounting the disk to the /var/lib/ceph/mon directory for slightly better performance.

2.12 Recommended Production Cluster Configuration Edit source

  • Seven Object Storage Nodes

    • No single node exceeds ~15% of total storage

    • 10 Gb Ethernet (four physical networks bonded to multiple switches)

    • 56+ OSDs per storage cluster

    • RAID 1 OS disks for each OSD storage node

    • SSDs for Journal with 6:1 ratio SSD journal to OSD

    • 1.5 GB of RAM per TB of raw OSD capacity for each Object Storage Node

    • 2 GHz per OSD for each Object Storage Node

  • Dedicated physical infrastructure nodes

    • Three Ceph Monitor nodes: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk

    • One SES management node: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk

    • Redundant physical deployment of gateway or Metadata Server nodes:

      • Object Gateway nodes: 32 GB RAM, 8 core processor, RAID 1 SSDs for disk

      • iSCSI Gateway nodes: 16 GB RAM, 4 core processor, RAID 1 SSDs for disk

      • Metadata Server nodes (one active/one hot standby): 32 GB RAM, 8 core processor, RAID 1 SSDs for disk

2.13 SUSE Enterprise Storage 6 and Other SUSE Products Edit source

This section contains important information about integrating SUSE Enterprise Storage 6 with other SUSE products.

2.13.1 SUSE Manager Edit source

SUSE Manager and SUSE Enterprise Storage are not integrated, therefore SUSE Manager cannot currently manage a SUSE Enterprise Storage cluster.

3 Admin Node HA Setup Edit source

Admin Node is a Ceph cluster node where the Salt master service is running. The admin node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Ceph Dashboard Web UI with the Grafana dashboard backed by the Prometheus monitoring toolkit.

In case of the Admin Node failure, you usually need to provide a new working hardware for the node and restore the complete cluster configuration stack from a recent backup. Such method is time consuming and causes cluster outage.

To prevent the Ceph cluster performance downtime caused by the admin node failure, we recommend to make use of the High Availability (HA) cluster for the Ceph admin node.

3.1 Outline of the HA Cluster for Admin Node Edit source

The idea of an HA cluster is that in case of one cluster node failure, the other node automatically takes over its role including the virtualized Admin Node. This way other Ceph cluster nodes do not notice that the Admin Node failed.

The minimal HA solution for the Admin Node requires the following hardware:

  • Two bare metal servers able to run SUSE Linux Enterprise with the High Availability extension and virtualize the Admin Node.

  • Two or more redundant network communication paths, for example via Network Device Bonding.

  • Shared storage to host the disk image(s) of the Admin Node virtual machine. The shared storage needs to be accessible form both servers. It can be for example an NFS export, a Samba share, or iSCSI target.

Find more details on the cluster requirements in https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/sec_ha_inst_quick_req.html.

2-Node HA Cluster for Admin Node
Figure 3.1: 2-Node HA Cluster for Admin Node

3.2 Building HA Cluster with Admin Node Edit source

The following procedure summarizes the most important steps of building the HA cluster for virtualizing the Admin Node. For details, refer to the indicated links.

  1. Set up a basic 2-node HA cluster with shared storage as described in https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/art_sleha_install_quick.html.

  2. On both cluster nodes, install all packages required for running the KVM hypervisor and the libvirt toolkit as described in https://www.suse.com/documentation/sles-15/book_virt/data/sec_vt_installation_kvm.html.

  3. On the first cluster node, create a new KVM virtual machine (VM) making use of libvirt as described in https://www.suse.com/documentation/sles-15/book_virt/data/sec_libvirt_inst_vmm.html. Use the preconfigured shared storage to store the disk images of the VM.

  4. After the VM setup is complete, export its configuration to an XML file on the shared storage. Use the following syntax:

    root # virsh dumpxml VM_NAME > /path/to/shared/vm_name.xml
  5. Create a resource for the Admin Node VM. Refer to https://www.suse.com/documentation/sle-ha-15/book_sleha_guide/data/cha_conf_hawk2.html for general info on creating HA resources. Detailed info on creating resource for a KVM virtual machine is described in http://www.linux-ha.org/wiki/VirtualDomain_%28resource_agent%29.

  6. On the newly created VM guest, deploy the Admin Node including the additional services you need there. Follow relevant steps in Section 4.3, “Cluster Deployment”. At the same time, deploy the remaining Ceph cluster nodes on the non-HA cluster servers.

Part II Cluster Deployment and Upgrade Edit source

4 Deploying with DeepSea/Salt

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

5 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note that version 5.5 is basically 5 with all latest patches applied.

6 Customizing the Default Configuration

You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Admin Node by default. You can perform the former by modifying the pillar updated after Stage 2, whi…

4 Deploying with DeepSea/Salt Edit source

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

  • Salt minions are the nodes controlled by a dedicated node called Salt master. Salt minions have roles, for example Ceph OSD, Ceph Monitor, Ceph Manager, Object Gateway, iSCSI Gateway, or NFS Ganesha.

  • A Salt master runs its own Salt minion. It is required for running privileged tasks—for example creating, authorizing, and copying keys to minions—so that remote minions never need to run privileged tasks.

    Tip
    Tip: Sharing Multiple Roles per Server

    You will get the best performance from your Ceph cluster when each role is deployed on a separate node. But real deployments sometimes require sharing one node for multiple roles. To avoid troubles with performance and upgrade procedure, do not deploy the Ceph OSD, Metadata Server, or Ceph Monitor role to the Admin Node.

  • Salt minions need to correctly resolve the Salt master's host name over the network. By default, they look for the salt host name, but you can specify any other network-reachable host name in the /etc/salt/minion file, see Section 4.3, “Cluster Deployment”.

4.1 Read the Release Notes Edit source

In the release notes you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:

  • your hardware needs special considerations.

  • any used software packages have changed significantly.

  • special precautions are necessary for your installation.

The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.

After having installed the package release-notes-ses, find the release notes locally in the directory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/.

4.2 Introduction to DeepSea Edit source

The goal of DeepSea is to save the administrator time and confidently perform complex operations on a Ceph cluster.

Ceph is a very configurable software solution. It increases both the freedom and responsibility of system administrators.

The minimal Ceph setup is good for demonstration purposes, but does not show interesting features of Ceph that you can see with a big number of nodes.

DeepSea collects and stores data about individual servers, such as addresses and device names. For a distributed storage system such as Ceph, there can be hundreds of such items to collect and store. Collecting the information and entering the data manually into a configuration management tool is exhausting and error prone.

The steps necessary to prepare the servers, collect the configuration, and configure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement.

DeepSea addresses these observations with the following strategy: DeepSea consolidates the administrator's decisions in a single file. The decisions include cluster assignment, role assignment and profile assignment. And DeepSea collects each set of tasks into a simple goal. Each goal is a stage:

DeepSea Stages Description
  • Stage 0—the preparation— during this stage, all required updates are applied and your system may be rebooted.

    Important
    Important: Re-run Stage 0 after the Admin Node Reboot

    If the Admin Node reboots during stage.0 to load the new kernel version, you need to run stage.0 again, otherwise minions will not be targeted.

  • Stage 1—the discovery—here all hardware in your cluster is being detected and necessary information for the Ceph configuration is being collected. For details about configuration, refer to Section 4.5, “Configuration and Customization”.

  • Stage 2—the configuration—you need to prepare configuration data in a particular format.

  • Stage 3—the deployment—creates a basic Ceph cluster with mandatory Ceph services. See Section 1.2.3, “Ceph Nodes and Daemons” for their list.

  • Stage 4—the services—additional features of Ceph like iSCSI, Object Gateway and CephFS can be installed in this stage. Each is optional.

  • Stage 5—the removal stage. This stage is not mandatory and during the initial setup it is usually not needed. In this stage the roles of minions and also the cluster configuration are removed. You need to run this stage when you need to remove a storage node from your cluster. For details refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.3 “Removing and Reinstalling Cluster Nodes”.

4.2.1 Organization and Important Locations Edit source

Salt has several standard locations and several naming conventions used on your master node:

/srv/pillar

The directory stores configuration data for your cluster minions. Pillar is an interface for providing global configuration values to all your cluster minions.

/srv/salt/

The directory stores Salt state files (also called sls files). State files are formatted descriptions of states in which the cluster should be.

/srv/module/runners

The directory stores Python scripts known as runners. Runners are executed on the master node.

/srv/salt/_modules

The directory stores Python scripts that are called modules. The modules are applied to all minions in your cluster.

/srv/pillar/ceph

The directory is used by DeepSea. Collected configuration data are stored here.

/srv/salt/ceph

A directory used by DeepSea. It stores sls files that can be in different formats, but each subdirectory contains sls files. Each subdirectory contains only one type of sls file. For example, /srv/salt/ceph/stage contains orchestration files that are executed by salt-run state.orchestrate.

4.2.2 Targeting the Minions Edit source

DeepSea commands are executed via the Salt infrastructure. When using the salt command, you need to specify a set of Salt minions that the command will affect. We describe the set of the minions as a target for the salt command. The following sections describe possible methods to target the minions.

4.2.2.1 Matching the Minion Name Edit source

You can target a minion or a group of minions by matching their names. A minion's name is usually the short host name of the node where the minion runs. This is a general Salt targeting method, not related to DeepSea. You can use globbing, regular expressions, or lists to limit the range of minion names. The general syntax follows:

root@master # salt target example.module
Tip
Tip: Ceph-only Cluster

If all Salt minions in your environment belong to your Ceph cluster, you can safely substitute target with '*' to include all registered minions.

Match all minions in the example.net domain (assuming the minion names are identical to their "full" host names):

root@master # salt '*.example.net' test.ping

Match the 'web1' to 'web5' minions:

root@master # salt 'web[1-5]' test.ping

Match both 'web1-prod' and 'web1-devel' minions using a regular expression:

root@master # salt -E 'web1-(prod|devel)' test.ping

Match a simple list of minions:

root@master # salt -L 'web1,web2,web3' test.ping

Match all minions in the cluster:

root@master # salt '*' test.ping

4.2.2.2 Targeting with a 'deepsea' Grain Edit source

In a heterogeneous Salt-managed environment where SUSE Enterprise Storage 6 is deployed on a subset of nodes alongside other cluster solutions, you need to mark the relevant minions by applying a 'deepsea' grain to them before running DeepSea stage.0. This way you can easily target DeepSea minions in environments where matching by the minion name is problematic.

To apply the 'deepsea' grain to a group of minions, run:

root@master # salt target grains.append deepsea default

To remove the 'deepsea' grain from a group of minions, run:

root@master # salt target grains.delval deepsea destructive=True

After applying the 'deepsea' grain to the relevant minions, you can target them as follows:

root@master # salt -G 'deepsea:*' test.ping

The following command is an equivalent:

root@master # salt -C 'G@deepsea:*' test.ping

4.2.2.3 Set the deepsea_minions Option Edit source

Setting the deepsea_minions option's target is a requirement for DeepSea deployments. DeepSea uses it to instruct minions during stages execution (refer to DeepSea Stages Description for details.

To set or change the deepsea_minions option, edit the /srv/pillar/ceph/deepsea_minions.sls file on the Salt master and add or replace the following line:

deepsea_minions: target
Tip
Tip: deepsea_minions Target

As the target for the deepsea_minions option, you can use any targeting method: both Matching the Minion Name and Targeting with a 'deepsea' Grain.

Match all Salt minions in the cluster:

deepsea_minions: '*'

Match all minions with the 'deepsea' grain:

deepsea_minions: 'G@deepsea:*'

4.2.2.4 For More Information Edit source

You can use more advanced ways to target minions using the Salt infrastructure. The 'deepsea-minions' manual page gives you more details about DeepSea targeting (man 7 deepsea_minions).

4.3 Cluster Deployment Edit source

The cluster deployment process has several phases. First, you need to prepare all nodes of the cluster by configuring Salt and then deploy and configure Ceph.

Tip
Tip: Deploying Monitor Nodes without Defining OSD Profiles

If you need to skip defining storage roles for OSD as described in Section 4.5.1.2, “Role Assignment” and deploy Ceph Monitor nodes first, you can do so by setting the DEV_ENV variable.

It allows deploying monitors without the presence of the role-storage/ directory, as well as deploying a Ceph cluster with at least one storage, monitor, and manager role.

To set the environment variable, either enable it globally by setting it in the /srv/pillar/ceph/stack/global.yml file, or set it for the current shell session only:

root@master # export DEV_ENV=true

As an example, /srv/pillar/ceph/stack/global.yml can be created with the following contents:

DEV_ENV: True

The following procedure describes the cluster preparation in detail.

  1. Install and register SUSE Linux Enterprise Server 15 SP1 together with the SUSE Enterprise Storage 6 extension on each node of the cluster.

  2. Verify that proper products are installed and registered by listing existing software repositories. Run zypper lr -E and compare the output with the following list:

     SLE-Product-SLES15-SP1-Pool
     SLE-Product-SLES15-SP1-Updates
     SLE-Module-Server-Applications15-SP1-Pool
     SLE-Module-Server-Applications15-SP1-Updates
     SLE-Module-Basesystem15-SP1-Pool
     SLE-Module-Basesystem15-SP1-Updates
     SUSE-Enterprise-Storage-6-Pool
     SUSE-Enterprise-Storage-6-Updates
  3. Configure network settings including proper DNS name resolution on each node. The Salt master and all the Salt minions need to resolve each other by their host names. For more information on configuring a network, see https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_network_yast.html For more information on configuring a DNS server, see https://www.suse.com/documentation/sles-15/book_sle_admin/data/cha_dns.html.

  4. Select one or more time servers/pools, and synchronize the local time against them. Verify that the time synchronization service is enabled on each system start-up. You can use the yast ntp-client command found in a yast2-ntp-client package to configure time synchronization.

    Tip
    Tip

    Virtual machines are not reliable NTP sources.

    Find more information on setting up NTP in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_ntp_yast.html.

  5. Install the salt-master and salt-minion packages on the Salt master node:

    root@master # zypper in salt-master salt-minion

    Check that the salt-master service is enabled and started, and enable and start it if needed:

    root@master # systemctl enable salt-master.service
    root@master # systemctl start salt-master.service
  6. If you intend to use firewall, verify that the Salt master node has ports 4505 and 4506 open to all Salt minion nodes. If the ports are closed, you can open them using the yast2 firewall command by allowing the SaltStack service.

    Warning
    Warning: DeepSea Stages Fail with Firewall

    DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running

        root # systemctl stop firewalld.service

    or set the FAIL_ON_WARNING option to 'False' in /srv/pillar/ceph/stack/global.yml:

    FAIL_ON_WARNING: False
  7. Install the package salt-minion on all minion nodes.

    root # zypper in salt-minion

    Make sure that the fully qualified domain name of each node can be resolved to the public network IP address by all other nodes.

  8. Configure all minions (including the master minion) to connect to the master. If your Salt master is not reachable by the host name salt, edit the file /etc/salt/minion or create a new file /etc/salt/minion.d/master.conf with the following content:

    master: host_name_of_salt_master

    If you performed any changes to the configuration files mentioned above, restart the Salt service on all Salt minions:

    root@minion > systemctl restart salt-minion.service
  9. Check that the salt-minion service is enabled and started on all nodes. Enable and start it if needed:

    root # systemctl enable salt-minion.service
    root # systemctl start salt-minion.service
  10. Verify each Salt minion's fingerprint and accept all salt keys on the Salt master if the fingerprints match.

    Note
    Note

    If the Salt minion fingerprint comes back empty, make sure the Salt minion has a Salt master configuration and it can communicate with the Salt master.

    View each minion's fingerprint:

    root@minion > salt-call --local key.finger
    local:
    3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

    After gathering fingerprints of all the Salt minions, list fingerprints of all unaccepted minion keys on the Salt master:

    root@master # salt-key -F
    [...]
    Unaccepted Keys:
    minion1:
    3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

    If the minions' fingerprint match, accept them:

    root@master # salt-key --accept-all
  11. Verify that the keys have been accepted:

    root@master # salt-key --list-all
  12. Prior to deploying SUSE Enterprise Storage 6, manually zap all the disks. Remember to replace 'X' with the correct disk letter:

    1. Stop all processes that are using the specific disk.

    2. Verify whether any partition on the disk is mounted, and unmount if needed.

    3. If the disk is managed by LVM, deactivate and delete the whole LVM infrastructure. Refer to https://www.suse.com/documentation/sles-15/book_storage/data/cha_lvm.html for more details.

    4. If the disk is part of MD RAID, deactivate the RAID. Refer to https://www.suse.com/documentation/sles-15/book_storage/data/part_software_raid.html for more details.

    5. Tip
      Tip: Rebooting the Server

      If you get error messages such as 'partition in use' or 'kernel can not be updated with the new partition table' during the following steps, reboot the server.

      Wipe the beginning of each partition (as root):

      for partition in /dev/sdX[0-9]*
      do
        dd if=/dev/zero of=$partition bs=4096 count=1 oflag=direct
      done
    6. Wipe the beginning of the drive:

      root # dd if=/dev/zero of=/dev/sdX bs=512 count=34 oflag=direct
    7. Wipe the end of the drive:

      root # dd if=/dev/zero of=/dev/sdX bs=512 count=33 \
        seek=$((`blockdev --getsz /dev/sdX` - 33)) oflag=direct
    8. Verify drive is empty (with no GPT structures) using:

      root # parted -s /dev/sdX print free

      or

      root # dd if=/dev/sdX bs=512 count=34 | hexdump -C
      root # dd if=/dev/sdX bs=512 count=33 \
        skip=$((`blockdev --getsz /dev/sdX` - 33)) | hexdump -C
  13. Optionally, if you need to preconfigure the cluster's network settings before the deepsea package is installed, create /srv/pillar/ceph/stack/ceph/cluster.yml manually and set the cluster_network: and public_network: options. Note that the file will not be overwritten after you install deepsea.

    Tip
    Tip: Enabling IPv6

    If you need to enable IPv6 network addressing, refer to Section 6.2.1, “Enabling IPv6 for Ceph Cluster Deployment”

  14. Install DeepSea on the Salt master node:

    root@master # zypper in deepsea
  15. The value of the master_minion parameter is dynamically derived from the /etc/salt/minion_id file on the Salt master. If you need to override the discovered value, edit the file /srv/pillar/ceph/stack/global.yml and set a relevant value:

    master_minion: MASTER_MINION_NAME

    If your Salt master is reachable via more host names, use the Salt minion name for the storage cluster as returned by the salt-key -L command. If you used the default host name for your Salt master—salt—in the ses domain, then the file looks as follows:

    master_minion: salt.ses

Now you deploy and configure Ceph. Unless specified otherwise, all steps are mandatory.

Note
Note: Salt Command Conventions

There are two possible ways how to run salt-run state.orch—one is with 'stage.STAGE_NUMBER', the other is with the name of the stage. Both notations have the same impact and it is fully your preference which command you use.

Procedure 4.1: Running Deployment Stages
  1. Ensure the Salt minions belonging to the Ceph cluster are correctly targeted through the deepsea_minions option in /srv/pillar/ceph/deepsea_minions.sls. Refer to Section 4.2.2.3, “Set the deepsea_minions Option” for more information.

  2. By default, DeepSea deploys Ceph clusters with tuned profiles active on Ceph Monitor, Ceph Manager, and Ceph OSD nodes. In some cases, you may need to deploy without tuned profiles. To do so, put the following lines in /srv/pillar/ceph/stack/global.yml before running DeepSea stages:

    alternative_defaults:
     tuned_mgr_init: default-off
     tuned_mon_init: default-off
     tuned_osd_init: default-off
  3. Optional: create Btrfs sub-volumes for /var/lib/ceph/. This step should only be executed before DeepSea stages. To migrate existing directories or for more details, see Book “Administration Guide”, Chapter 23 “Hints and Tips”, Section 23.6 “Btrfs Subvolume for /var/lib/ceph on Ceph Monitor Nodes”.

    root@master # salt 'MON_NODES' state.apply ceph.subvolume
  4. Prepare your cluster. Refer to DeepSea Stages Description for more details.

    root@master # salt-run state.orch ceph.stage.0

    or

    root@master # salt-run state.orch ceph.stage.prep
    Note
    Note: Run or Monitor Stages using DeepSea CLI

    Using the DeepSea CLI, you can follow the stage execution progress in real-time, either by running the DeepSea CLI in the monitoring mode, or by running the stage directly through DeepSea CLI. For details refer to Section 4.4, “DeepSea CLI”.

  5. The discovery stage collects data from all minions and creates configuration fragments that are stored in the directory /srv/pillar/ceph/proposals. The data are stored in the YAML format in *.sls or *.yml files.

    Run the following command to trigger the discovery stage:

    root@master # salt-run state.orch ceph.stage.1

    or

    root@master # salt-run state.orch ceph.stage.discovery
  6. After the previous command finishes successfully, create a policy.cfg file in /srv/pillar/ceph/proposals. For details refer to Section 4.5.1, “The policy.cfg File”.

    Tip
    Tip

    If you need to change the cluster's network setting, edit /srv/pillar/ceph/stack/ceph/cluster.yml and adjust the lines starting with cluster_network: and public_network:.

  7. The configuration stage parses the policy.cfg file and merges the included files into their final form. Cluster and role related content are placed in /srv/pillar/ceph/cluster, while Ceph specific content is placed in /srv/pillar/ceph/stack/default.

    Run the following command to trigger the configuration stage:

    root@master # salt-run state.orch ceph.stage.2

    or

    root@master # salt-run state.orch ceph.stage.configure

    The configuration step may take several seconds. After the command finishes, you can view the pillar data for the specified minions (for example, named ceph_minion1, ceph_minion2, etc.) by running:

    root@master # salt 'ceph_minion*' pillar.items
    Tip
    Tip: Modifying OSD's Layout

    If you want to modify the default OSD's layout and change the drive groups configuration, follow the procedure described in Section 4.5.2, “DriveGroups”.

    Note
    Note: Overwriting Defaults

    As soon as the command finishes, you can view the default configuration and change it to suit your needs. For details refer to Chapter 6, Customizing the Default Configuration.

  8. Now you run the deployment stage. In this stage, the pillar is validated, and the Ceph Monitor and Ceph OSD daemons are started:

    root@master # salt-run state.orch ceph.stage.3

    or

    root@master # salt-run state.orch ceph.stage.deploy

    The command may take several minutes. If it fails, you need to fix the issue and run the previous stages again. After the command succeeds, run the following to check the status:

    cephadm > ceph -s
  9. The last step of the Ceph cluster deployment is the services stage. Here you instantiate any of the currently supported services: iSCSI Gateway, CephFS, Object Gateway, and NFS Ganesha. In this stage, the necessary pools, authorizing keyrings, and starting services are created. To start the stage, run the following:

    root@master # salt-run state.orch ceph.stage.4

    or

    root@master # salt-run state.orch ceph.stage.services

    Depending on the setup, the command may run for several minutes.

  10. Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.2 “Telemetry Module” for information and instructions.

4.4 DeepSea CLI Edit source

DeepSea also provides a command line interface (CLI) tool that allows the user to monitor or run stages while visualizing the execution progress in real-time. Verify that the deepsea-cli package is installed before you run the deepsea executable.

Two modes are supported for visualizing a stage's execution progress:

DeepSea CLI Modes
  • Monitoring mode: visualizes the execution progress of a DeepSea stage triggered by the salt-run command issued in another terminal session.

  • Stand-alone mode: runs a DeepSea stage while providing real-time visualization of its component steps as they are executed.

Important
Important: DeepSea CLI Commands

The DeepSea CLI commands can only be run on the Salt master node with the root privileges.

4.4.1 DeepSea CLI: Monitor Mode Edit source

The progress monitor provides a detailed, real-time visualization of what is happening during execution of stages using salt-run state.orch commands in other terminal sessions.

Tip
Tip: Start Monitor in a New Terminal Session

You need to start the monitor in a new terminal window before running any salt-run state.orch so that the monitor can detect the start of the stage's execution.

If you start the monitor after issuing the salt-run state.orch command, then no execution progress will be shown.

You can start the monitor mode by running the following command:

root@master # deepsea monitor

For more information about the available command line options of the deepsea monitor command check its manual page:

cephadm > man deepsea-monitor

4.4.2 DeepSea CLI: Stand-alone Mode Edit source

In the stand-alone mode, DeepSea CLI can be used to run a DeepSea stage, showing its execution in real-time.

The command to run a DeepSea stage from the DeepSea CLI has the following form:

root@master # deepsea stage run stage-name

where stage-name corresponds to the way Salt orchestration state files are referenced. For example, stage deploy, which corresponds to the directory located in /srv/salt/ceph/stage/deploy, is referenced as ceph.stage.deploy.

This command is an alternative to the Salt-based commands for running DeepSea stages (or any DeepSea orchestration state file).

The command deepsea stage run ceph.stage.0 is equivalent to salt-run state.orch ceph.stage.0.

For more information about the available command line options accepted by the deepsea stage run command check its manual page:

root@master # man deepsea-stage run

In the following figure shows an example of the output of the DeepSea CLI when running Stage 2:

DeepSea CLI stage execution progress output
Figure 4.1: DeepSea CLI stage execution progress output

4.4.2.1 DeepSea CLI stage run Alias Edit source

For advanced users of Salt, we also support an alias for running a DeepSea stage that takes the Salt command used to run a stage, for example, salt-run state.orch stage-name, as a command of the DeepSea CLI.

Example:

root@master # deepsea salt-run state.orch stage-name

4.5 Configuration and Customization Edit source

4.5.1 The policy.cfg File Edit source

The /srv/pillar/ceph/proposals/policy.cfg configuration file is used to determine roles of individual cluster nodes. For example, which nodes act as Ceph OSDs or Ceph Monitors. Edit policy.cfg in order to reflect your desired cluster setup. The order of the sections is arbitrary, but the content of included lines overwrites matching keys from the content of previous lines.

Tip
Tip: Examples of policy.cfg

You can find several examples of complete policy files in the /usr/share/doc/packages/deepsea/examples/ directory.

4.5.1.1 Cluster Assignment Edit source

In the cluster section you select minions for your cluster. You can select all minions, or you can blacklist or whitelist minions. Examples for a cluster called ceph follow.

To include all minions, add the following lines:

cluster-ceph/cluster/*.sls

To whitelist a particular minion:

cluster-ceph/cluster/abc.domain.sls

or a group of minions—you can shell glob matching:

cluster-ceph/cluster/mon*.sls

To blacklist minions, set the them to unassigned:

cluster-unassigned/cluster/client*.sls

4.5.1.2 Role Assignment Edit source

This section provides you with details on assigning 'roles' to your cluster nodes. A 'role' in this context means the service you need to run on the node, such as Ceph Monitor, Object Gateway, or iSCSI Gateway. No role is assigned automatically, only roles added to policy.cfg will be deployed.

The assignment follows this pattern:

role-ROLE_NAME/PATH/FILES_TO_INCLUDE

Where the items have the following meaning and values:

  • ROLE_NAME is any of the following: 'master', 'admin', 'mon', 'mgr', 'storage', 'mds', 'igw', 'rgw', 'ganesha', 'grafana', or 'prometheus'.

  • PATH is a relative directory path to .sls or .yml files. In case of .sls files, it usually is cluster, while .yml files are located at stack/default/ceph/minions.

  • FILES_TO_INCLUDE are the Salt state files or YAML configuration files. They normally consist of Salt minions host names, for example ses5min2.yml. Shell globbing can be used for more specific matching.

An example for each role follows:

  • master - the node has admin keyrings to all Ceph clusters. Currently, only a single Ceph cluster is supported. As the master role is mandatory, always add a similar line to the following:

    role-master/cluster/master*.sls
  • admin - the minion will have an admin keyring. You define the role as follows:

    role-admin/cluster/abc*.sls
  • mon - the minion will provide the monitor service to the Ceph cluster. This role requires addresses of the assigned minions. Since SUSE Enterprise Storage 5, the public address are calculated dynamically and are no longer needed in the Salt pillar.

    role-mon/cluster/mon*.sls

    The example assigns the monitor role to a group of minions.

  • mgr - the Ceph manager daemon which collects all the state information from the whole cluster. Deploy it on all minions where you plan to deploy the Ceph monitor role.

    role-mgr/cluster/mgr*.sls
  • storage - use this role to specify storage nodes.

    role-storage/cluster/data*.sls
  • mds - the minion will provide the metadata service to support CephFS.

    role-mds/cluster/mds*.sls
  • igw - the minion will act as an iSCSI Gateway. This role requires addresses of the assigned minions, thus you need to also include the files from the stack directory:

    role-igw/cluster/*.sls
  • rgw - the minion will act as an Object Gateway:

    role-rgw/cluster/rgw*.sls
  • ganesha - the minion will act as an NFS Ganesha server. The 'ganesha' role requires either an 'rgw' or 'mds' role in cluster, otherwise the validation will fail in Stage 3.

    role-ganesha/cluster/ganesha*.sls

    To successfully install NFS Ganesha, additional configuration is required. If you want to use NFS Ganesha, read Chapter 11, Installation of NFS Ganesha before executing stages 2 and 4. However, it is possible to install NFS Ganesha later.

    In some cases it can be useful to define custom roles for NFS Ganesha nodes. For details, see Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”, Section 19.3 “Custom NFS Ganesha Roles”.

  • grafana, prometheus - this node adds Grafana charts based on Prometheus alerting to the Ceph Dashboard. Refer to Book “Administration Guide”, Chapter 20 “Ceph Dashboard” for its detailed description.

    role-grafana/cluster/grafana*.sls
    role-prometheus/cluster/prometheus*.sls
Note
Note: Multiple Roles of Cluster Nodes

You can assign several roles to a single node. For example, you can assign the 'mds' roles to the monitor nodes:

role-mds/cluster/mon[1,2]*.sls

4.5.1.3 Common Configuration Edit source

The common configuration section includes configuration files generated during the discovery (Stage 1). These configuration files store parameters like fsid or public_network. To include the required Ceph common configuration, add the following lines:

config/stack/default/global.yml
config/stack/default/ceph/cluster.yml

4.5.1.4 Item Filtering Edit source

Sometimes it is not practical to include all files from a given directory with *.sls globbing. The policy.cfg file parser understands the following filters:

Warning
Warning: Advanced Techniques

This section describes filtering techniques for advanced users. When not used correctly, filtering can cause problems for example in case your node numbering changes.

slice=[start:end]

Use the slice filter to include only items start through end-1. Note that items in the given directory are sorted alphanumerically. The following line includes the third to fifth files from the role-mon/cluster/ subdirectory:

role-mon/cluster/*.sls slice[3:6]
re=regexp

Use the regular expression filter to include only items matching the given expressions. For example:

role-mon/cluster/mon*.sls re=.*1[135]\.subdomainX\.sls$

4.5.1.5 Example policy.cfg File Edit source

Following is an example of a basic policy.cfg file:

## Cluster Assignment
cluster-ceph/cluster/*.sls 1

## Roles
# ADMIN
role-master/cluster/examplesesadmin.sls 2
role-admin/cluster/sesclient*.sls 3

# MON
role-mon/cluster/ses-example-[123].sls 4

# MGR
role-mgr/cluster/ses-example-[123].sls 5

# STORAGE
role-storage/cluster/ses-example-[5,6,7,8].sls 6

# MDS
role-mds/cluster/ses-example-4.sls 7

# IGW
role-igw/cluster/ses-example-4.sls 8

# RGW
role-rgw/cluster/ses-example-4.sls 9

# COMMON
config/stack/default/global.yml 10
config/stack/default/ceph/cluster.yml 11

1

Indicates that all minions are included in the Ceph cluster. If you have minions you do not want to include in the Ceph cluster, use:

cluster-unassigned/cluster/*.sls
cluster-ceph/cluster/ses-example-*.sls

The first line marks all minions as unassigned. The second line overrides minions matching 'ses-example-*.sls', and assigns them to the Ceph cluster.

2

The minion called 'examplesesadmin' has the 'master' role. This, by the way, means it will get admin keys to the cluster.

3

All minions matching 'sesclient*' will get admin keys as well.

4

All minions matching 'ses-example-[123]' (presumably three minions: ses-example-1, ses-example-2, and ses-example-3) will be set up as MON nodes.

5

All minions matching 'ses-example-[123]' (all MON nodes in the example) will be set up as MGR nodes.

6

All minions matching 'ses-example-[5,6,7,8]' will be set up as storage nodes.

7

Minion 'ses-example-4' will have the MDS role.

8

Minion 'ses-example-4' will have the IGW role.

9

Minion 'ses-example-4' will have the RGW role.

10

Means that we accept the default values for common configuration parameters such as fsid and public_network.

11

Means that we accept the default values for common configuration parameters such as fsid and public_network.

4.5.2 DriveGroups Edit source

DriveGroups specify the layouts of OSD's in the Ceph cluster. They are defined in a single file /srv/salt/ceph/configuration/files/drive_groups.yml.

An administrator should manually specify a group of OSDs that are interrelated (hybrid OSDs that are deployed on solid state and spinners) or share the same deployment options (identical, for example same objectstore, same encryption option, stand-alone OSDs). To avoid explicitly listing devices, DriveGroups use a list of filter items that correspond to a few selected fields of ceph-volume's inventory reports. In the simplest case this could be the 'rotational' flag (all solid-state drives are to be db_devices, all rotating one data devices) or something more involved such as 'model' strings, or sizes. DeepSea will provide code that translates these DriveGroups into actual device lists for inspection by the user.

Following is a simple procedure that demonstrates the basic workflow when configuring DriveGroups:

  1. Inspect your disks' properties as seen by the ceph-volume command. Only these properties are accepted by DriveGroups:

    root@master # salt-run disks.details
  2. Open the /srv/salt/ceph/configuration/files/drive_groups.yml YAML file and adjust to your needs. Refer to Section 4.5.2.1, “Specification”. Remember to use spaces instead of tabs. Find more advanced examples in Section 4.5.2.4, “Examples”. The following example includes all drives available to Ceph as OSD's:

    default_drive_group_name:
      target: '*'
      data_devices:
        all: true
  3. Verify new layouts:

    root@master # salt-run disks.list

    This runner returns you a structure of matching disks based on your drive groups. If you are not happy with the result, repeat the previous step.

    Tip
    Tip: Detailed Report

    In addition to the disks.list runner, there is a disks.report runner that prints out a detailed report of what will happen in the next DeepSea stage.3 invocation.

    root@master # salt-run disks.report
  4. Deploy OSD's. On the next DeepSea stage.3 invocation, the OSD disks will be deployed according to your drive group specification.

4.5.2.1 Specification Edit source

/srv/salt/ceph/configuration/files/drive_groups.yml accepts the following options:

drive_group_default_name:
  target: *
  data_devices:
    drive_spec: DEVICE_SPECIFICATION
  db_devices:
    drive_spec: DEVICE_SPECIFICATION
  wal_devices:
    drive_spec: DEVICE_SPECIFICATION
  block_wal_size: '5G'  # (optional, unit suffixes permitted)
  block_db_size: '5G'   # (optional, unit suffixes permitted)
  osds_per_device: 1   # number of osd daemons per device
  format:              # 'bluestore' or 'filestore' (defaults to 'bluestore')
  encryption:           # 'True' or 'False' (defaults to 'False')

For FileStore setups, drive_groups.yml can be as follows:

drive_group_default_name:
  target: *
  data_devices:
    drive_spec: DEVICE_SPECIFICATION
  journal_devices:
    drive_spec: DEVICE_SPECIFICATION
  format: filestore
  encryption: True

4.5.2.2 Matching Disk Devices Edit source

You can describe the specification using the following filters:

  • By a disk model:

    model: DISK_MODEL_STRING
  • By a disk vendor:

    vendor: DISK_VENDOR_STRING
    Tip
    Tip: Lowercase Vendor String

    Always lowercase the DISK_VENDOR_STRING.

  • Whether a disk is rotational or not. SSD'd and NVME's are not rotational.

    rotational: 0
  • Deploy a node using all available drives for OSD's:

    data_devices:
      all: true
  • Additionally, by limiting the number of matching disks:

    limit: 10

4.5.2.3 Filtering Devices by Size Edit source

You can filter disk devices by their size—either by an exact size, or a size range. The size: parameter accepts arguments in the following form:

  • '10G' - Includes disks of an exact size.

  • '10G:40G' - Includes disks which size is within the range.

  • ':10G' - Includes disks less than or equal to 10G in size.

  • '40G:' - Includes disks equal to or greater than 40G in size.

Example 4.1: Matching by Disk Size
drive_group_default:
  target: '*'
  data_devices:
    size: '40TB:'
  db_devices:
    size: ':2TB'
Note
Note: Quotes Required

When using the ':' delimiter, you need to enclose the size in quotes, otherwise the ':' sign will be interpreted as a new configuration hash.

Tip
Tip: Units Shortcuts

Instead of (G)igabytes, you can specify the sizes in (M)egabytes or (T)errabytes as well.

4.5.2.4 Examples Edit source

This section includes examples of different OSD setups.

Example 4.2: Simple Setup

This example describes two nodes with the same setup:

  • 20 HDDs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 2 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

The corresponding drive_groups.yml file will be as follows:

drive_group_default:
  target: '*'
  data_devices:
    model: SSD-123-foo
  db_devices:
    model: MC-55-44-XZ

Such configuration is simple and valid. The problem is that an administrator may add disks of different vendors in the future, and these will not be included. You can improve it by reducing the filters on core properties of the drives:

drive_group_default:
  target: '*'
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0

In the previous example, we are enforcing all rotating devices to be declared as 'data devices' and all non-rotating devices will be used as 'shared devices' (wal, db).

If you know that drives with more than 2TB will always be the slower data devices, you can filter by size:

drive_group_default:
  target: '*'
  data_devices:
    size: '2TB:'
  db_devices:
    size: ':2TB'
Example 4.3: Advanced Setup

This example describes two distinct setups: 20 HDDs should share 2 SSDs, while 10 SSDs should share 2 NVMes.

  • 20 HDDs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 12 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

  • 2 NVMEs

    • Vendor: Samsung

    • Model: NVME-QQQQ-987

    • Size: 256GB

Such setup can be defined with two layouts as follows:

drive_group:
  target: '*'
  data_devices:
    rotational: 0
  db_devices:
    model: MC-55-44-XZ
drive_group_default:
  target: '*'
  data_devices:
    model: MC-55-44-XZ
  db_devices:
    vendor: samsung
    size: 256GB
Example 4.4: Advanced Setup with Non-uniform Nodes

The previous examples assumed that all nodes have the same drives. However, that is not always the case:

Node 1-5:

  • 20 HDDs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 2 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

Node 6-10:

  • 5 NVMEs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 20 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

You can use the 'target' key in the layout to target specific nodes. Salt target notation helps to keep things simple:

drive_group_node_one_to_five:
  target: 'node[1-5]'
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0

followed by

drive_group_the_rest:
  target: 'node[6-10]'
  data_devices:
    model: MC-55-44-XZ
  db_devices:
    model: SSD-123-foo
Example 4.5: Expert Setup

All previous cases assumed that the WALs and DBs use the same device. It is however possible to deploy the WAL on a dedicated device as well:

  • 20 HDDs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 2 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

  • 2 NVMEs

    • Vendor: Samsung

    • Model: NVME-QQQQ-987

    • Size: 256GB

drive_group_default:
  target: '*'
  data_devices:
    model: MC-55-44-XZ
  db_devices:
    model: SSD-123-foo
  wal_devices:
    model: NVME-QQQQ-987
Example 4.6: Complex (and Unlikely) Setup

In the following setup, we are trying to define:

  • 20 HDDs backed by 1 NVME

  • 2 HDDs backed by 1 SSD(db) and 1 NVME(wal)

  • 8 SSDs backed by 1 NVME

  • 2 SSDs stand-alone (encrypted)

  • 1 HDD is spare and should not be deployed

The summary of used drives follows:

  • 23 HDDs

    • Vendor: Intel

    • Model: SSD-123-foo

    • Size: 4TB

  • 10 SSDs

    • Vendor: Micron

    • Model: MC-55-44-ZX

    • Size: 512GB

  • 1 NVMEs

    • Vendor: Samsung

    • Model: NVME-QQQQ-987

    • Size: 256GB

The DriveGroups definition will be following:

drive_group_hdd_nvme:
  target: '*'
  data_devices:
    rotational: 0
  db_devices:
    model: NVME-QQQQ-987
drive_group_hdd_ssd_nvme:
  target: '*'
  data_devices:
    rotational: 0
  db_devices:
    model: MC-55-44-XZ
  wal_devices:
    model: NVME-QQQQ-987
drive_group_ssd_nvme:
  target: '*'
  data_devices:
    model: SSD-123-foo
  db_devices:
    model: NVME-QQQQ-987
drive_group_ssd_standalone_encrypted:
  target: '*'
  data_devices:
    model: SSD-123-foo
  encryption: True

One HDD will remain as the file is being parsed from top to bottom.

4.5.3 Adjusting ceph.conf with Custom Settings Edit source

If you need to put custom settings into the ceph.conf configuration file, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.13 “Adjusting ceph.conf with Custom Settings” for more details.

5 Upgrading from Previous Releases Edit source

This chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note that version 5.5 is basically 5 with all latest patches applied.

Note
Note: Upgrade from Older Releases not Supported

The upgrade from SUSE Enterprise Storage version older than 5.5 is not supported. You first need to upgrade to the latest version of SUSE Enterprise Storage 5.5 and then follow steps in this chapter.

5.1 Points to Consider before the Upgrade Edit source

  • Read the release notes - there you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:

    • your hardware needs special considerations.

    • any used software packages have changed significantly.

    • special precautions are necessary for your installation.

    The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.

    After having installed the package release-notes-ses, find the release notes locally in the directory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/.

  • In case you previously upgraded from version 4, verify that the upgrade to version 5 was completed successfully:

    • Check for the existence of the file

      /srv/salt/ceph/configuration/files/ceph.conf.import

      It is created by the engulf process during the upgrade from SES 4 to 5. Also, the configuration_init: default-import option is set in the file

      /srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml

      If configuration_init is still set to default-import, the cluster is using ceph.conf.import as its configuration file and not the DeepSea's default ceph.conf which is compiled from files in

      /srv/salt/ceph/configuration/files/ceph.conf.d/

      Therefore you need to inspect ceph.conf.import for any custom configuration, and possibly move the configuration to one of the files in

      /srv/salt/ceph/configuration/files/ceph.conf.d/

      Then remove the configuration_init: default-import line from

      /srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml
      Warning
      Warning: Default DeepSea Configuration

      If you do not merge the configuration from ceph.conf.import and remove the configuration_init: default-import option, any default configuration settings we ship as part of DeepSea (stored in /srv/salt/ceph/configuration/files/ceph.conf.j2) will not be applied to the cluster.

    • Check if the cluster uses the new bucket type 'straw2':

      cephadm > ceph osd crush dump | grep straw
    • Check that Ceph 'jewel' profile is used:

      cephadm > ceph osd crush dump | grep profile
  • In case old RBD kernel clients (older than SUSE Linux Enterprise Server 12 SP3) are being used, refer to Book “Administration Guide”, Chapter 10 “RADOS Block Device”, Section 10.9 “Mapping RBD using Old Kernel Clients”. We recommend upgrading old RBD kernel clients if possible.

  • If openATTIC is located on the Admin Node, it will be unavailable after you upgrade the node. The new Ceph Dashboard will not be available until you deploy it by using DeepSea.

  • The cluster upgrade may take a long time—approximately the time it takes to upgrade one machine multiplied by the number of cluster nodes.

  • A single node cannot be upgraded while running the previous SUSE Linux Enterprise Server release, but needs to be rebooted into the new version's installer. Therefore the services that the node provides will be unavailable for some time. The core cluster services will still be available—for example if one MON is down during upgrade, there are still at least two active MONs. Unfortunately, single instance services, such as a single iSCSI Gateway, will be unavailable.

  • Certain types of daemons depend upon others. For example Ceph Object Gateways depend upon Ceph MON and OSD daemons. We recommend upgrading in this order:

    1. Admin Node

    2. Ceph Monitors/Ceph Managers

    3. Metadata Servers

    4. Ceph OSDs

    5. Object Gateways

    6. iSCSI Gateways

    7. NFS Ganesha

    8. Samba Gateways

  • If you used AppArmor in either 'complain' or 'enforce' mode, you need to set a Salt pillar variable before upgrading. Because SUSE Linux Enterprise Server 15 SP1 ships with AppArmor by default, AppArmor management was integrated into DeepSea stage.0. The default behavior in SUSE Enterprise Storage 6 is to remove AppArmor and related profiles. If you want to retain the behavior configured in SUSE Enterprise Storage 5.5, verify that one of the following lines is present in the /srv/pillar/ceph/stack/global.yml file before starting the upgrade:

    apparmor_init: default-enforce

    or

    apparmor_init: default-complain
  • Since SUSE Enterprise Storage 6, MDS names starting with a digit are no longer allowed and MDS daemons will refuse to start. You can check whether your daemons have such names either by running the ceph fs status command, or by restarting an MDS and checking its logs for the following message:

    deprecation warning: MDS id 'mds.1mon1' is invalid and will be forbidden in
    a future version.  MDS names may not start with a numeric digit.

    If you see the above message, the MDS names will need to be migrated before attempting to upgrade to SUSE Enterprise Storage 6. DeepSea provides an orchestration to automate such migration. MDS names starting with a digit will be prepended with 'mds.':

    root@master # salt-run state.orch ceph.mds.migrate-numerical-names
    Tip
    Tip: Custom Configuration Bound to MDS Names

    If you have configuration settings that are bound to MDS names and your MDS daemons have names starting with a digit, verify that your configuration settings apply to the new names as well (with the 'mds.' prefix). Consider the following example section in the /etc/ceph/ceph.conf file:

    [mds.123-my-mds] # config setting specific to MDS name with a name starting with a digit
     mds cache memory limit = 1073741824
     mds standby for name = 456-another-mds

    The ceph.mds.migrate-numerical-names orchestrator will change the MDS daemon name '123-my-mds' to 'mds.123-my-mds'. You need to adjust the configuration to reflect the new name:

    [mds.mds,123-my-mds] # config setting specific to MDS name with the new name
    mds cache memory limit = 1073741824
    mds standby for name = mds.456-another-mds

    This will add MDS daemons with the new names before removing the old MDS daemons. The number of MDS daemons will double for a short time. Clients will be able to access CephFS with a short pause to failover. Therefore plan the migration for times when you expect little or no CephFS load.

5.2 Backup Cluster Data Edit source

Although creating backups of cluster's configuration and data is not mandatory, we strongly recommend backing up important configuration files and cluster data. Refer to Book “Administration Guide”, Chapter 2 “Backing Up Cluster Configuration and Data” for more details.

5.3 Migrate from ntpd to chronyd Edit source

SUSE Linux Enterprise Server 15 SP1 no longer uses ntpd to synchronize the local host time. Instead, chronyd is used. You need to migrate the time synchronization daemon on each cluster node. You can migrate to chronyd either before the cluster, or upgrade the cluster and migrate to chronyd afterward.

Procedure 5.1: Migrate to chronyd before the Cluster Upgrade
  1. Install the chrony package:

    root@minion > zypper install chrony
  2. Edit the chronyd configuration file /etc/chrony.conf and add NTP sources from the current ntpd configuration in /etc/ntp.conf.

  3. Disable and stop the ntpd service:

    root@master # systemctl disable ntpd.service && systemctl stop ntpd.service
  4. Start and enable the chronyd service:

    root@master # systemctl start chronyd.service && systemctl enable chronyd.service
  5. Verify the status of chrony:

    root@master # chronyc tracking
Procedure 5.2: Migrate to chronyd after the Cluster Upgrade
  1. During cluster upgrade, add the following software repositories:

    • SLE-Module-Legacy15-SP1-Pool

    • SLE-Module-Legacy15-SP1-Updates

  2. Upgrade the cluster to version 6.

  3. Edit the chronyd configuration file /etc/chrony.conf and add NTP sources from the current ntpd configuration in /etc/ntp.conf.

    Tip
    Tip: More Details on chronyd Configuration

    Refer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html to find more details about how to include time sources in chronyd configuration.

  4. Disable and stop the ntpd service:

    root@master # systemctl disable ntpd.service && systemctl stop ntpd.service
  5. Start and enable the chronyd service:

    root@master # systemctl start chronyd.service && systemctl enable chronyd.service
  6. Migrate from ntpd to chronyd.

  7. Verify the status of chrony:

    root@master # chronyc tracking
  8. Remove the legacy software repositories that you added to keep ntpd in the system during the upgrade process.

5.4 Patch Cluster Prior to Upgrade Edit source

Apply the latest patches to all cluster nodes prior to upgrade.

5.4.1 Required Software Repositories Edit source

Check that required repositories are configured on each cluster's node. To list all available repositories, run

root@minion > zypper lr

SUSE Enterprise Storage 5.5 requires:

  • SLES12-SP3-Installer-Updates

  • SLES12-SP3-Pool

  • SLES12-SP3-Updates

  • SUSE-Enterprise-Storage-5-Pool

  • SUSE-Enterprise-Storage-5-Updates

NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 12 SP3 requires:

  • SLE-HA12-SP3-Pool

  • SLE-HA12-SP3-Updates

5.4.2 Repository Staging Systems Edit source

If you are using one of repository staging systems—SMT, RMT;, or SUSE Manager—create a new frozen patch level for the current and the new SUSE Enterprise Storage version.

Find more information in:

5.4.3 Patch the Whole Cluster to the Latest Patches Edit source

  1. Apply latest patches of SUSE Enterprise Storage 5.5 and SUSE Linux Enterprise Server 12 SP3 to each Ceph cluster node. Verify that correct software repositories are connected to each cluster node (see Section 5.4.1, “Required Software Repositories”) and run DeepSea stage.0:

    root@master # salt-run state.orch ceph.stage.0
  2. After stage.0 completes, verify that each cluster node's status includes 'HEALTH_OK'. If not, resolve the problem before possible reboots in next steps.

  3. Run zypper ps to check for processes that may run with outdated libraries or binaries, and reboot if there are some.

  4. Verify that the running kernel is the latest available and reboot if not. Check outputs of the following commands:

    cephadm > uname -a
    cephadm > rpm -qa kernel-default
  5. Verify that the ceph package is version 12.2.12 or newer. Verify that the deepsea package is version 0.8.9 or newer.

  6. If you previously used any of the bluestore_cache settings, they are not effective any more since ceph version 12.2.10. The new setting bluestore_cache_autotune that is set to 'true' by default disables manual cache sizing. To turn on the old behavior, you need to set bluestore_cache_autotune=false. Refer to Book “Administration Guide”, Chapter 14 “Ceph Cluster Configuration”, Section 14.2.1 “Automatic Cache Sizing” for details.

5.5 Verify the Current Environment Edit source

  • If the system has obvious problems, fix them before starting the upgrade. Upgrade never fixes existing system problems.

  • Check cluster performance. You can use commands such as rados bench, ceph tell osd.* bench, or iperf3.

  • Verify access to gateways (such as iSCSI Gateway, or Object Gateway) and RADOS Block Device.

  • Document specific parts of the system setup, such as network setup, partitioning, or installation details.

  • Use supportconfig to collect important system information and save it outside cluster nodes. Find more information in https://www.suse.com/documentation/sles-12/book_sle_admin/data/sec_admsupport_supportconfig.html.

  • Ensure there is enough free disk space on each cluster node. Check free disk space with df -h. When needed, free disk space by removing unneeded files/directories or removing obsolete OS snapshots. If there is not enough free disk space, do not continue with the upgrade unless you free enough disk space.

5.6 Check the Cluster's State Edit source

  • Check the cluster health command before starting the upgrade procedure. Do not start the upgrade unless each cluster node reports 'HEALTH_OK'.

  • Verify that all services are running:

    • Salt master and Salt master daemons.

    • Ceph Monitor and Ceph Manager daemons.

    • Metadata Server server daemons.

    • Ceph OSD daemons.

    • Object Gateway daemons.

    • iSCSI Gateway daemons.

The following commands provide details cluster state and specific configuration:

ceph -s

Prints a brief summary of Ceph cluster health, running services, data usage, and IO statistics. Verify that it reports 'HEALTH_OK' before starting the upgrade.

ceph health detail

Prints details if Ceph cluster health is not OK.

ceph versions

Prints versions of running Ceph daemons.

ceph df

Prints total and free disk space on the cluster. Do not start the upgrade if the cluster's free disk space is less than 25% of the total disk space.

salt '*' cephprocesses.check results=true

Prints running Ceph processes and their PID's sorted by Salt minions.

ceph osd dump | grep ^flags

Verify that 'recovery_deletes' and 'purged_snapdirs' flags are present. If not, you can force a scrub on all placement groups by running the following command. Be aware that this forced scrub may possibly have a negative impact on your Ceph clients’ performance.

cephadm > ceph pg dump pgs_brief | cut -d " " -f 1 | xargs -n1 ceph pg scrub

5.7 Offline Upgrade of CTDB Clusters Edit source

CTDB provides a clustered database used by Samba Gateways. The CTDB protocol is very simple and does not support clusters of nodes communicating with different protocol versions. Therefore CTDB nodes need to be taken offline prior to performing an upgrade.

5.8 Per Node Upgrade—Basic Procedure Edit source

To ensure the core cluster services are available during the upgrade, you need to upgrade the cluster nodes sequentially one by one.

Tip
Tip: Orphaned Packages

After a node is upgraded, a number of packages will be in an 'orphaned' state without a parent repository. This happens because python3 related packages do not obsolete python2 packages.

Find more information about listing orphaned packages in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_zypper.html#sec_zypper_softup_orphaned.

5.8.1 Manual Node Upgrade using the Installer DVD Edit source

  1. Reboot the node from the SUSE Linux Enterprise Server 15 SP1 installer DVD/image.

  2. Select Upgrade from the boot menu.

  3. On the Select the Migration Target screen, verify that 'SUSE Linux Enterprise Server 15 SP1' is selected and activate the Manually Adjust the Repositories for Migration check box.

    Select the Migration Target
    Figure 5.1: Select the Migration Target
  4. Select the following modules to install:

    • SUSE Enterprise Storage 6 x86_64

    • Basesystem Module 15 SP1 x86_64

    • Desktop Applications Module 15 SP1 x86_64

    • Legacy Module 15 SP1 x86_64

    • Server Applications Module 15 SP1 x86_64

  5. On the Previously Used Repositories screen, verify that the correct repositories are selected. If the system is not registered with SCC/SMT, you need to add the repositories manually.

    SUSE Enterprise Storage 6 requires:

    • SLE-Module-Basesystem15-SP1-Pool

    •  SLE-Module-Basesystem15-SP1-Updates

    •  SLE-Module-Server-Applications15-SP1-Pool

    •  SLE-Module-Server-Applications15-SP1-Updates

    • SLE-Module-Desktop-Applications15-SP1-Pool

    • SLE-Module-Desktop-Applications15-SP1-Updates

    •  SLE-Product-SLES15-SP1-Pool

    •  SLE-Product-SLES15-SP1-Updates

    •  SLE15-SP1-Installer-Updates

    •  SUSE-Enterprise-Storage-6-Pool

    •  SUSE-Enterprise-Storage-6-Updates

    If you intend to migrate ntpd to chronyd after SES migration (refer to Section 5.3, “Migrate from ntpd to chronyd), include the following repositories:

    • SLE-Module-Legacy15-SP1-Pool

    • SLE-Module-Legacy15-SP1-Updates

    NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 15 SP1 requires:

    • SLE-Product-HA15-SP1-Pool

    • SLE-Product-HA15-SP1-Updates

  6. Review the Installation Settings and start the installation procedure by clicking Update.

5.9 Upgrade the Admin Node Edit source

5.10 Upgrade Ceph Monitor/Ceph Manager Nodes Edit source

  • If your cluster does not use MDS roles, upgrade MON/MGR nodes one by one.

  • If your cluster uses MDS roles, and MON/MGR and MDS roles are co-located, you need to shrink the MDS cluster and then upgrade the co-located nodes. Refer to Section 5.11, “Upgrade Metadata Servers” for more details.

  • If your cluster uses MDS roles and they run on dedicated servers, upgrade all MON/MGR nodes one by one, then shrink the MDS cluster and upgrade it. Refer to Section 5.11, “Upgrade Metadata Servers” for more details.

Note
Note: Ceph Monitor Upgrade

Due to a limitation in the Ceph Monitor design, once two MONs have been upgraded to SUSE Enterprise Storage 6 and have formed a quorum, the third MON (while still on SUSE Enterprise Storage 5.5) will not rejoin the MON cluster if it restarted for any reason, including a node reboot. Therefore, when two MONs have been upgraded it is best to upgrade the rest as soon as possible.

Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

5.11 Upgrade Metadata Servers Edit source

You need to shrink the Metadata Server (MDS) cluster. Because of incompatible features between the SUSE Enterprise Storage 5.5 and 6 versions, the older MDS daemons will shutdown as soon as they see a single SES 6 level MDS join the cluster. Therefore it is necessary to shrink the MDS cluster to a single active MDS (and no standby's) for the duration of the MDS node upgrades. As soon as the second node is upgraded, you can extend the MDS cluster again.

Tip
Tip

On a heavily loaded MDS cluster, you may need to to reduce the load (for example by stopping clients) so that a single active MDS is able to handle the workload.

  1. Note the current value of the max_mds option:

    cephadm > ceph fs get cephfs | grep max_mds
  2. Shrink the MDS cluster if you have more then 1 active MDS daemons, i.e. max_mds is > 1. To shrink the MDS cluster, run

    cephadm > ceph fs set FS_NAME max_mds 1

    where FS_NAME is the name of your CephFS instance ('cephfs' by default).

  3. Find the node hosting one of standby MDS daemons. Consult the output of the ceph fs status command and start the upgrade of the MDS cluster on this node.

    cephadm > ceph fs status
    cephfs - 2 clients
    ======
    +------+--------+--------+---------------+-------+-------+
    | Rank | State  |  MDS   |    Activity   |  dns  |  inos |
    +------+--------+--------+---------------+-------+-------+
    |  0   | active | mon1-6 | Reqs:    0 /s |   13  |   16  |
    +------+--------+--------+---------------+-------+-------+
    +-----------------+----------+-------+-------+
    |       Pool      |   type   |  used | avail |
    +-----------------+----------+-------+-------+
    | cephfs_metadata | metadata | 2688k | 96.8G |
    |   cephfs_data   |   data   |    0  | 96.8G |
    +-----------------+----------+-------+-------+
    +-------------+
    | Standby MDS |
    +-------------+
    |    mon3-6   |
    |    mon2-6   |
    +-------------+

    In this example, you need to start the upgrade procedure either on node 'mon3-6' or 'mon2-6'.

  4. Upgrade the node with standby MDS daemon. After the upgraded MDS node starts, the outdated MDS daemons will shutdown automatically. At this point, clients may experience a short downtime of the CephFS service.

    Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

  5. Upgrade the remaining MDS nodes.

  6. Reset max_mds to the desired configuration:

    root@master # ceph fs set FS_NAME max_mds ACTIVE_MDS_COUNT

5.12 Upgrade Ceph OSDs Edit source

For each storage node, follow these steps:

  1. Identify which OSD daemons are running on a particular node:

    cephadm > ceph osd tree
  2. Set the 'noout' flag for each OSD daemon on the node that is being upgraded:

    cephadm > ceph osd add-noout osd.OSD_ID

    For example:

    cephadm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd add-noout osd.$i; done

    Verify with:

    cephadm > ceph health detail | grep noout

    or

    cephadm > ceph –s
    cluster:
     id:     44442296-033b-3275-a803-345337dc53da
     health: HEALTH_WARN
          6 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
  3. Create /etc/ceph/osd/*.json files for all existing OSDs by running the following command on the node that is going be upgraded:

    cephadm > ceph-volume simple scan --force
  4. Upgrade the OSD node. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

  5. Activate all OSDs found in the system:

    cephadm > ceph-volume simple activate --all
    Tip
    Tip: Activating Data Partitions Individually

    If you want to activate data partitions individually, you need to find the correct ceph-volume command for each partition to activate it. Replace X1 with the partition's correct letter/number:

     cephadm > ceph-volume simple scan /dev/sdX1

    For example:

    cephadm > ceph-volume simple scan /dev/vdb1
    [...]
    --> OSD 8 got scanned and metadata persisted to file:
    /etc/ceph/osd/8-d7bd2685-5b92-4074-8161-30d146cd0290.json
    --> To take over management of this scanned OSD, and disable ceph-disk
    and udev, run:
    -->     ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290

    The last line of the output contains the command to activate the partition:

    cephadm > ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290
    [...]
    --> All ceph-disk systemd units have been disabled to prevent OSDs
    getting triggered by UDEV events
    [...]
    Running command: /bin/systemctl start ceph-osd@8
    --> Successfully activated OSD 8 with FSID
    d7bd2685-5b92-4074-8161-30d146cd0290
  6. Verify that the OSD node will start properly after the reboot.

  7. Address the 'Legacy BlueStore stats reporting detected on XX OSD(s)' message:

    cephadm > ceph –s
    cluster:
     id:     44442296-033b-3275-a803-345337dc53da
     health: HEALTH_WARN
     Legacy BlueStore stats reporting detected on 6 OSD(s)

    The warning is normal when upgrading Ceph to 14.2.2. You can disable it by setting:

    bluestore_warn_on_legacy_statfs = false

    The proper fix is to run the following command on all OSDs while they are stopped:

    cephadm > ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-XXX

    Following is a helper script that runs the ceph-bluestore-tool repair for all OSDs on the NODE_NAME node:

    OSDNODE=OSD_NODE_NAME;\
     for OSD in $(ceph osd ls-tree $OSDNODE);\
     do echo "osd=" $OSD;\
     salt $OSDNODE cmd.run 'systemctl stop ceph-osd@$OSD';\
     salt $OSDNODE cmd.run 'ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$OSD';\
     salt $OSDNODE cmd.run 'systemctl start ceph-osd@$OSD';\
     done
  8. Unset the 'noout' flag for each OSD daemon on the node that is upgraded:

    cephadm > ceph osd rm-noout osd.OSD_ID

    For example:

    cephadm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd rm-noout osd.$i; done

    Verify with:

    cephadm > ceph health detail | grep noout

    Note:

    cephadm > ceph –s
    cluster:
     id:     44442296-033b-3275-a803-345337dc53da
     health: HEALTH_WARN
     Legacy BlueStore stats reporting detected on 6 OSD(s)
  9. Verify the cluster status. It will be similar to the following output:

    cephadm > ceph status
    cluster:
      id:     e0d53d64-6812-3dfe-8b72-fd454a6dcf12
      health: HEALTH_WARN
              3 monitors have not enabled msgr2
    
    services:
      mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h)
      mgr: mon2(active, since 22m), standbys: mon1, mon3
      osd: 30 osds: 30 up, 30 in
    
    data:
      pools:   1 pools, 1024 pgs
      objects: 0 objects, 0 B
      usage:   31 GiB used, 566 GiB / 597 GiB avail
      pgs:     1024 active+clean
    Tip
    Tip: Check for the Version of Cluster Components/Nodes

    When you need to find out the versions of individual cluster components and nodes—for example to find out if all your nodes are actually on the same patch level after the upgrade—you can run

    root@master # salt-run status.report

    The command goes through the connected Salt minions and scans for the version numbers of Ceph, Salt, and SUSE Linux Enterprise Server, and gives you a report displaying the version that the majority of nodes have and showing nodes whose version is different from the majority.

  10. Verify that all OSD nodes were rebooted and that OSDs started automatically after the reboot.

5.13 OSD Migration to BlueStore Edit source

OSD BlueStore is a new back end for the OSD daemons. It is the default option since SUSE Enterprise Storage 5. Compared to FileStore, which stores objects as files in an XFS file system, BlueStore can deliver increased performance because it stores objects directly on the underlying block device. BlueStore also enables other features, such as built-in compression and EC overwrites, that are unavailable with FileStore.

Specifically for BlueStore, an OSD has a 'wal' (Write Ahead Log) device and a 'db' (RocksDB database) device. The RocksDB database holds the metadata for a BlueStore OSD. These two devices will reside on the same device as an OSD by default, but either can be placed on different, for example faster, media.

In SUSE Enterprise Storage 5, both FileStore and BlueStore are supported and it is possible for FileStore and BlueStore OSDs to co-exist in a single cluster. During the SUSE Enterprise Storage upgrade procedure, FileStore OSDs are not automatically converted to BlueStore. Be aware that the BlueStore-specific features will not be available on OSDs that have not been migrated to BlueStore.

Before converting to BlueStore, the OSDs need to be running SUSE Enterprise Storage 5. The conversion is a slow process as all data gets re-written twice. Though the migration process can take a long time to complete, there is no cluster outage and all clients can continue accessing the cluster during this period. However, do expect lower performance for the duration of the migration. This is caused by rebalancing and backfilling of cluster data.

Use the following procedure to migrate FileStore OSDs to BlueStore:

Tip
Tip: Turn Off Safety Measures

Salt commands needed for running the migration are blocked by safety measures. In order to turn these precautions off, run the following command:

 root@master # salt-run disengage.safety

Rebuild the nodes before continuing:

 root@master #  salt-run rebuild.node TARGET

You can also choose to rebuild each node individually. For example:

root@master #  salt-run rebuild.node data1.ceph

The rebuild.node always removes and recreates all OSDs on the node.

Important
Important

If one OSD fails to convert, re-running the rebuild destroys the already converted BlueStore OSDs. Instead of re-running the rebuild, you can run:

           root@master #  salt-run disks.deploy TARGET

After the migration to BlueStore, the object count will remain the same and disk usage will be nearly the same.

5.14 Upgrade Application Nodes Edit source

Upgrade application nodes in the following order:

  1. Object Gateways

    • If the Object Gateways are fronted by a load balancer, then a rolling upgrade of the Object Gateways should be possible without an outage.

    • Validate that the Object Gateway daemons are running after each upgrade, and test with S3/Swift client.

    • Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

  2. iSCSI Gateways

    • If iSCSI initiators are configured with multipath, then a rolling upgrade of the iSCSI Gateways should be possible without an outage.

    • Validate that the lrbd daemon is running after each upgrade, and test with initiator.

    • Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

  3. NFS Ganesha. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

  4. Samba Gateways. Use the procedure described in Section 5.8, “Per Node Upgrade—Basic Procedure”.

5.15 Update policy.cfg and Deploy Ceph Dashboard using DeepSea Edit source

On the Admin Node, edit /srv/pillar/ceph/proposals/policy.cfg and apply the following changes:

Important
Important: No New Services

During cluster upgrade, do not add new services to the policy.cfg file. Change the cluster architecture only after the upgrade is completed.

  1. Remove role-openattic.

  2. Add role-prometheus and role-grafana to the node that had Prometheus and Grafana installed, usually the Admin Node.

  3. Role profile-PROFILE_NAME is now ignored. Add new corresponding role, role-storage line. For example, for existing

    profile-default/cluster/*.sls

    add

    role-storage/cluster/*.sls
  4. Synchronize all Salt modules:

    root@master # salt '*' saltutil.sync_all
  5. Update the Salt pillar by running DeepSea stage.1 and stage.2:

    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run state.orch ceph.stage.2
  6. Cleanup openATTIC:

    root@master # salt OA_MINION state.apply ceph.rescind.openattic
    root@master # salt OA_MINION state.apply ceph.remove.openattic
  7. Unset the 'restart_igw' grain to prevent stage.0 from restarting iSCSI Gateway which is not installed yet:

    Salt mastersalt '*' grains.delkey restart_igw
  8. Finally, run through DeepSea stages 0-4:

    root@master # salt-run state.orch ceph.stage.0
    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3
    root@master # salt-run state.orch ceph.stage.4
    Tip
    Tip: 'subvolume missing' Errors during stage.3

    DeepSea stage.3 may fail with an error similar to the following:

    subvolume : ['/var/lib/ceph subvolume missing on 4510-2', \
    '/var/lib/ceph subvolume missing on 4510-1', \
    [...]
    'See /srv/salt/ceph/subvolume/README.md']

    In this case, you need to edit /srv/pillar/ceph/stack/global.yml and add the following line:

    subvolume_init: disabled

    Then refresh the Salt pillar and re-run DeepSea stage.3:

    root@master # salt '*' saltutil.refresh_pillar
     root@master # salt-run state.orch ceph.stage.3

    After DeepSea successfully finished stage.3, the Ceph Dashboard will be running. Refer to Book “Administration Guide”, Chapter 20 “Ceph Dashboard” for a detailed overview of Ceph Dashboard features.

    To list node running dashboard run:

    cephadm > ceph mgr services | grep dashboard

    To list admin credentials run:

    root@master # salt-call grains.get dashboard_creds
  9. Sequentially restart the Object Gateway services to use 'beast' Web server instead of the outdated 'civetweb':

    root@master # salt-run state.orch ceph.restart.rgw.force
  10. Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.2 “Telemetry Module” for information and instructions.

5.16 Migration from Profile-based Deployments to DriveGroups Edit source

In SUSE Enterprise Storage 5.5, DeepSea offered so called 'profiles' to describe the layout of your OSDs. Starting with SUSE Enterprise Storage 6, we moved to a different approach called DriveGroups (find more details in Section 4.5.2, “DriveGroups”).

Note
Note

Migrating to the new approach is not immediately mandatory. Destructive operations, such as salt-run osd.remove, salt-run osd.replace, or salt-run osd.purge are still available. However, adding new OSDs will require your action.

Because of the different approach of these implementations, we do not offer an automated migration path. However, we offer a variety of tools—Salt runners—to make the migration as simple as possible.

5.16.1 Analyze the Current Layout Edit source

To view information about the currently deployed OSDs, use the following command:

root@master # salt-run disks.discover

Alternatively, you can inspect the content of files in the /srv/pillar/ceph/proposals/profile-*/ directories. The have similar structure to the following:

ceph:
  storage:
    osds:
      /dev/disk/by-id/scsi-drive_name: format: bluestore
      /dev/disk/by-id/scsi-drive_name2: format: bluestore

5.16.2 Create DriveGroups Matching the Current Layout Edit source

Refer to Section 4.5.2.1, “Specification” for more details on DriveGroups specification.

The difference between a fresh deployment and upgrade scenario is that the drives to be migrated are already 'used'. Because

root@master # salt-run disks.list

looks for unused disks only, use

root@master # salt-run disks.list include_unavailable=True

Adjust DriveGroups until you match your current setup. For a more visual representation of what will be happening, use the following command. Note that it has no output if there are no free disks:

root@master # salt-run disks.report bypass_pillar=True

If you verified that your DriveGroups are properly configured and want to apply the new approach, remove files form the /srv/pillar/ceph/proposals/profile-PROFILE_NAME/ directory, remove corresponding profile-PROFILE_NAME/cluster/*.sls lines from the /srv/pillar/ceph/proposals/policy.cfg file, and run DeepSea stage.2 to refresh the Salt pillar.

root@master # salt-run state.orch ceph.stage.2

Verify the result by running the following commands:

root@master # salt target_node pillar.get ceph:storage
root@master # salt-run disks.report
Warning
Warning: Incorrect DriveGroups Configuration

If your DriveGroups are not properly configured and there are spare disks in your setup, they will be deployed in the way you specified them. We recommend running:

root@master # salt-run disks.report

5.16.3 OSD Deployment Edit source

For simple cases such as standalone OSDs, the migration will happen over-time. Whenever you remove or replace an OSD from the cluster, it will be replaced by a new, LVM based OSD.

Tip
Tip: Migrate to LVM Format

Whenever a single 'legacy' OSD needs to be replaced on a node, all OSDs that share devices with it need to be migrated to the LVM-based format.

For completeness, consider migrating OSDs on the whole node.

5.16.4 More Complex Setups Edit source

If you have a more sophisticated setup than just sandalone OSDs, for example dedicated WAL/DBs or encrypted OSDs, the migration can only happen when all OSDs assigned to that WAL/DB device are removed. This is caused by the ceph-volume that creates Logical Volumes on disks before the deployment. This prevents the user from mixing partition based deployments with LV based deployments. In such cases it is best to manually remove all OSDs that are assigned to a WAL/DB device and re-deploy them using the DriveGroups approach.

6 Customizing the Default Configuration Edit source

You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Admin Node by default. You can perform the former by modifying the pillar updated after Stage 2, while the latter is usually done by creating a custom sls file and adding it to the pillar. Details are described in following sections.

6.1 Using Customized Configuration Files Edit source

This section lists several tasks that require adding/changing your own sls files. Such a procedure is typically used when you need to change the default deployment process.

Tip
Tip: Prefix Custom .sls Files

Your custom .sls files belong to the same subdirectory as DeepSea's .sls files. To prevent overwriting your .sls files with the possibly newly added ones from the DeepSea package, prefix their name with the custom- string.

6.1.1 Disabling a Deployment Step Edit source

If you address a specific task outside of the DeepSea deployment process and therefore need to skip it, create a 'no-operation' file following this example:

Procedure 6.1: Disabling Time Synchronization
  1. Create /srv/salt/ceph/time/disabled.sls with the following content and save it:

    disable time setting:
    test.nop
  2. Edit /srv/pillar/ceph/stack/global.yml, add the following line, and save it:

    time_init: disabled
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.time
    admin.ceph:
      Name: disable time setting - Function: test.nop - Result: Clean
    
    Summary for admin.ceph
    ------------
    Succeeded: 1
    Failed:    0
    ------------
    Total states run:     1
    Note
    Note: Unique ID

    The task ID 'disable time setting' may be any message unique within an sls file. Prevent ID collisions by specifying unique descriptions.

6.1.2 Replacing a Deployment Step Edit source

If you need to replace the default behavior of a specific step with a custom one, create a custom sls file with replacement content.

By default /srv/salt/ceph/pool/default.sls creates an rbd image called 'demo'. In our example, we do not want this image to be created, but we need two images: 'archive1' and 'archive2'.

Procedure 6.2: Replacing the demo rbd Image with Two Custom rbd Images
  1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

    wait:
      module.run:
        - name: wait.out
        - kwargs:
            'status': "HEALTH_ERR"1
        - fire_event: True
    
    archive1:
      cmd.run:
        - name: "rbd -p rbd create archive1 --size=1024"2
        - unless: "rbd -p rbd ls | grep -q archive1$"
        - fire_event: True
    
    archive2:
      cmd.run:
        - name: "rbd -p rbd create archive2 --size=768"
        - unless: "rbd -p rbd ls | grep -q archive2$"
        - fire_event: True

    1

    The wait module will pause until the Ceph cluster does not have a status of HEALTH_ERR. In fresh installations, a Ceph cluster may have this status until a sufficient number of OSDs become available and the creation of pools has completed.

    2

    The rbd command is not idempotent. If the same creation command is re-run after the image exists, the Salt state will fail. The unless statement prevents this.

  2. To call the newly created custom file instead of the default, you need to edit /srv/pillar/ceph/stack/ceph/cluster.yml, add the following line, and save it:

    pool_init: custom
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.pool
Note
Note: Authorization

The creation of pools or images requires sufficient authorization. The admin.ceph minion has an admin keyring.

Tip
Tip: Alternative Way

Another option is to change the variable in /srv/pillar/ceph/stack/ceph/roles/master.yml instead. Using this file will reduce the clutter of pillar data for other minions.

6.1.3 Modifying a Deployment Step Edit source

Sometimes you may need a specific step to do some additional tasks. We do not recommend modifying the related state file as it may complicate a future upgrade. Instead, create a separate file to carry out the additional tasks identical to what was described in Section 6.1.2, “Replacing a Deployment Step”.

Name the new sls file descriptively. For example, if you need to create two rbd images in addition to the demo image, name the file archive.sls.

Procedure 6.3: Creating Two Additional rbd Images
  1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

    include:
     - .archive
     - .default
    Tip
    Tip: Include Precedence

    In this example, Salt will create the archive images and then create the demo image. The order does not matter in this example. To change the order, reverse the lines after the include: directive.

    You can add the include line directly to archive.sls and all the images will get created as well. However, regardless of where the include line is placed, Salt processes the steps in the included file first. Although this behavior can be overridden with requires and order statements, a separate file that includes the others guarantees the order and reduces the chances of confusion.

  2. Edit /srv/pillar/ceph/stack/ceph/cluster.yml, add the following line, and save it:

    pool_init: custom
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.pool

6.1.4 Modifying a Deployment Stage Edit source

If you need to add a completely separate deployment step, create three new files—an sls file that performs the command, an orchestration file, and a custom file which aligns the new step with the original deployment steps.

For example, if you need to run logrotate on all minions as part of the preparation stage:

First create an sls file and include the logrotate command.

Procedure 6.4: Running logrotate on all Salt minions
  1. Create a directory such as /srv/salt/ceph/logrotate.

  2. Create /srv/salt/ceph/logrotate/init.sls with the following content and save it:

    rotate logs:
      cmd.run:
        - name: "/usr/sbin/logrotate /etc/logrotate.conf"
  3. Verify that the command works on a minion:

    root@master # salt 'admin.ceph' state.apply ceph.logrotate

Because the orchestration file needs to run before all other preparation steps, add it to the Prep stage 0:

  1. Create /srv/salt/ceph/stage/prep/logrotate.sls with the following content and save it:

    logrotate:
      salt.state:
        - tgt: '*'
        - sls: ceph.logrotate
  2. Verify that the orchestration file works:

    root@master # salt-run state.orch ceph.stage.prep.logrotate

The last file is the custom one which includes the additional step with the original steps:

  1. Create /srv/salt/ceph/stage/prep/custom.sls with the following content and save it:

    include:
      - .logrotate
      - .master
      - .minion
  2. Override the default behavior. Edit /srv/pillar/ceph/stack/global.yml, add the following line, and save the file:

    stage_prep: custom
  3. Verify that Stage 0 works:

    root@master # salt-run state.orch ceph.stage.0
Note
Note: Why global.yml?

The global.yml file is chosen over the cluster.yml because during the prep stage, no minion belongs to the Ceph cluster and has no access to any settings in cluster.yml.

6.1.5 Updates and Reboots during Stage 0 Edit source

During Stage 0 (refer to DeepSea Stages Description for more information on DeepSea stages), the Salt master and Salt minions may optionally reboot because newly updated packages, for example kernel, require rebooting the system.

The default behavior is to install available new updates and not reboot the nodes even in case of kernel updates.

You can change the default update/reboot behavior of DeepSea Stage 0 by adding/changing the stage_prep_master and stage_prep_minion options in the /srv/pillar/ceph/stack/global.yml file. stage_prep_master sets the behavior of the Salt master, and stage_prep_minion sets the behavior of all minions. All available parameters are:

default

Install updates without rebooting.

default-update-reboot

Install updates and reboot after updating.

default-no-update-reboot

Reboots without installing updates.

default-no-update-no-reboot

Do not install updates or reboot.

For example, to prevent the cluster nodes from installing updates and rebooting, edit /srv/pillar/ceph/stack/global.yml and add the following lines:

stage_prep_master: default-no-update-no-reboot
stage_prep_minion: default-no-update-no-reboot
Tip
Tip: Values and Corresponding Files

The values of stage_prep_master correspond to file names located in /srv/salt/ceph/stage/0/master, while values of stage_prep_minion correspond to files in /srv/salt/ceph/stage/0/minion:

cephadm > ls -l /srv/salt/ceph/stage/0/master
default-no-update-no-reboot.sls
default-no-update-reboot.sls
default-update-reboot.sls
[...]

cephadm > ls -l /srv/salt/ceph/stage/0/minion
default-no-update-no-reboot.sls
default-no-update-reboot.sls
default-update-reboot.sls
[...]

6.2 Modifying Discovered Configuration Edit source

After you completed Stage 2, you may want to change the discovered configuration. To view the current settings, run:

root@master # salt target pillar.items

The output of the default configuration for a single minion is usually similar to the following:

----------
    available_roles:
        - admin
        - mon
        - storage
        - mds
        - igw
        - rgw
        - client-cephfs
        - client-radosgw
        - client-iscsi
        - mds-nfs
        - rgw-nfs
        - master
    cluster:
        ceph
    cluster_network:
        172.16.22.0/24
    fsid:
        e08ec63c-8268-3f04-bcdb-614921e94342
    master_minion:
        admin.ceph
    mon_host:
        - 172.16.21.13
        - 172.16.21.11
        - 172.16.21.12
    mon_initial_members:
        - mon3
        - mon1
        - mon2
    public_address:
        172.16.21.11
    public_network:
        172.16.21.0/24
    roles:
        - admin
        - mon
        - mds
    time_server:
        admin.ceph
    time_service:
        ntp

The above mentioned settings are distributed across several configuration files. The directory structure with these files is defined in the /srv/pillar/ceph/stack/stack.cfg directory. The following files usually describe your cluster:

  • /srv/pillar/ceph/stack/global.yml - the file affects all minions in the Salt cluster.

  • /srv/pillar/ceph/stack/ceph/cluster.yml - the file affects all minions in the Ceph cluster called ceph.

  • /srv/pillar/ceph/stack/ceph/roles/role.yml - affects all minions that are assigned the specific role in the ceph cluster.

  • /srv/pillar/ceph/stack/ceph/minions/MINION_ID/yml - affects the individual minion.

Note
Note: Overwriting Directories with Default Values

There is a parallel directory tree that stores the default configuration setup in /srv/pillar/ceph/stack/default. Do not change values here, as they are overwritten.

The typical procedure for changing the collected configuration is the following:

  1. Find the location of the configuration item you need to change. For example, if you need to change cluster related setting such as cluster network, edit the file /srv/pillar/ceph/stack/ceph/cluster.yml.

  2. Save the file.

  3. Verify the changes by running:

    root@master # salt target saltutil.pillar_refresh

    and then

    root@master # salt target pillar.items

6.2.1 Enabling IPv6 for Ceph Cluster Deployment Edit source

Since IPv4 network addressing is prevalent, you need to enable IPv6 as a customization. DeepSea has no auto discovery of IPv6 addressing.

To configure IPv6, set the public_network and cluster_network variables in the /srv/pillar/ceph/stack/global.yml file to valid IPv6 subnets. For example:

public_network: fd00:10::/64
cluster_network: fd00:11::/64

Then run DeepSea Stage 2 and verify that the network information matches the setting. Stage 3 will generate the ceph.conf with the necessary flags.

Important
Important: No Support for Dual Stack

Ceph does not support dual stack—running Ceph simultaneously on IPv4 and IPv6 is not possible. DeepSea validation will reject a mismatch between public_network and cluster_network or within either variable. The following example will fail the validation.

public_network: "192.168.10.0/24 fd00:10::/64"
Tip
Tip: Avoid using fe80::/10 link-local Addresses

Avoid using fe80::/10 link-local addresses. All network interfaces have an assigned fe80 address and require an interface qualifier for proper routing. Either assign IPv6 addresses allocated to your site or consider using fd00::/8. These are part of ULA and not globally routable.

Part III Installation of Additional Services Edit source

7 Installation of Services to Access your Data

After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If y…

8 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph clusters. It supports two interfaces:

9 Installation of iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage 6 includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and …

10 Installation of CephFS

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

11 Installation of NFS Ganesha

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.

7 Installation of Services to Access your Data Edit source

After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If you have a cluster deployed using Salt, refer to Chapter 4, Deploying with DeepSea/Salt for a procedure on installing particular gateways or the CephFS.

8 Ceph Object Gateway Edit source

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph clusters. It supports two interfaces:

  • S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.

  • Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

The Object Gateway daemon uses 'Beast' HTTP front-end by default. It uses the Boost.Beast library for HTTP parsing and the Boost.Asio library for asynchronous network I/O operations.

Because Object Gateway provides interfaces compatible with OpenStack Swift and Amazon S3, the Object Gateway has its own user management. Object Gateway can store data in the same cluster that is used to store data from CephFS clients or RADOS Block Device clients. The S3 and Swift APIs share a common name space, so you may write data with one API and retrieve it with the other.

Important
Important: Object Gateway Deployed by DeepSea

Object Gateway is installed as a DeepSea role, therefore you do not need to install it manually.

To install the Object Gateway during the cluster deployment, see Section 4.3, “Cluster Deployment”.

To add a new node with Object Gateway to the cluster, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.

8.1 Object Gateway Manual Installation Edit source

  1. Install Object Gateway on a node that is not using port 80. The following command installs all required components:

    root # zypper ref && zypper in ceph-radosgw
  2. If the Apache server from the previous Object Gateway instance is running, stop it and disable the relevant service:

    root # systemctl stop disable apache2.service
  3. Edit /etc/ceph/ceph.conf and add the following lines:

    [client.rgw.gateway_host]
     rgw frontends = "beast port=80"
    Tip
    Tip

    If you want to configure Object Gateway/Beast for use with SSL encryption, modify the line accordingly:

    rgw frontends = beast ssl_port=7480 ssl_certificate=PATH_TO_CERTIFICATE.PEM
  4. Restart the Object Gateway service.

    root # systemctl restart ceph-radosgw@rgw.gateway_host

8.1.1 Object Gateway Configuration Edit source

Several steps are required to configure an Object Gateway.

8.1.1.1 Basic Configuration Edit source

Configuring a Ceph Object Gateway requires a running Ceph Storage Cluster. The Ceph Object Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

  • A host name for the gateway instance, for example gateway.

  • A storage cluster user name with appropriate permissions and a keyring.

  • Pools to store its data.

  • A data directory for the gateway instance.

  • An instance entry in the Ceph configuration file.

Each instance must have a user name and key to communicate with a Ceph storage cluster. In the following steps, we use a monitor node to create a bootstrap keyring, then create the Object Gateway instance user keyring based on the bootstrap one. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.

  1. Create a keyring for the gateway:

    root # ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyring
    root # chmod +r /etc/ceph/ceph.client.rgw.keyring
  2. Generate a Ceph Object Gateway user name and key for each instance. As an example, we will use the name gateway after client.radosgw:

    root # ceph-authtool /etc/ceph/ceph.client.rgw.keyring \
      -n client.rgw.gateway --gen-key
  3. Add capabilities to the key:

    root # ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \
      --cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring
  4. Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

    root # ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \
      -i /etc/ceph/ceph.client.rgw.keyring
  5. Distribute the keyring to the node with the gateway instance:

    root # scp /etc/ceph/ceph.client.rgw.keyring  ceph@HOST_NAME:/home/ceph
    cephadm > ssh HOST_NAME
    root # mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: Use Bootstrap Keyring

An alternative way is to create the Object Gateway bootstrap keyring, and then create the Object Gateway keyring from it:

  1. Create an Object Gateway bootstrap keyring on one of the monitor nodes:

    root # ceph \
     auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name mon. \
     --keyring=/var/lib/ceph/mon/ceph-NODE_HOST/keyring \
     -o /var/lib/ceph/bootstrap-rgw/keyring
  2. Create the /var/lib/ceph/radosgw/ceph-RGW_NAME directory for storing the bootstrap keyring:

    root # mkdir \
    /var/lib/ceph/radosgw/ceph-RGW_NAME
  3. Create an Object Gateway keyring from the newly created bootstrap keyring:

    root # ceph \
     auth get-or-create client.rgw.RGW_NAME osd 'allow rwx' mon 'allow rw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name client.bootstrap-rgw \
     --keyring=/var/lib/ceph/bootstrap-rgw/keyring \
     -o /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring
  4. Copy the Object Gateway keyring to the Object Gateway host:

    root # scp \
    /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring \
    RGW_HOST:/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring

8.1.1.2 Create Pools (Optional) Edit source

Ceph Object Gateways require Ceph Storage Cluster pools to store specific gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph configuration file.

The pool names follow the ZONE_NAME.POOL_NAME syntax. When configuring a gateway with the default region and zone, the default zone name is 'default' as in our example:

.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data

To create the pools manually, see Book “Administration Guide”, Chapter 9 “Managing Storage Pools”, Section 9.2.2 “Create a Pool”.

Important
Important: Object Gateway and Erasure-coded Pools

Only the default.rgw.buckets.data pool can be erasure-coded. All other pools need to be replicated, otherwise the gateway is not accessible.

8.1.1.3 Adding Gateway Configuration to Ceph Edit source

Add the Ceph Object Gateway configuration to the Ceph Configuration file. The Ceph Object Gateway configuration requires you to identify the Ceph Object Gateway instance. Then, specify the host name where you installed the Ceph Object Gateway daemon, a keyring (for use with cephx), and optionally a log file. For example:

[client.rgw.INSTANCE_NAME]
host = HOST_NAME
keyring = /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: Object Gateway Log File

To override the default Object Gateway log file, include the following:

log file = /var/log/radosgw/client.rgw.INSTANCE_NAME.log

The [client.rgw.*] portion of the gateway instance identifies this portion of the Ceph configuration file as configuring a Ceph Storage Cluster client where the client type is a Ceph Object Gateway (radosgw). The instance name follows. For example:

[client.rgw.gateway]
host = ceph-gateway
keyring = /etc/ceph/ceph.client.rgw.keyring
Note
Note

The HOST_NAME must be your machine host name, excluding the domain name.

Then turn off print continue. If you have it set to true, you may encounter problems with PUT operations:

rgw print continue = false

To use a Ceph Object Gateway with subdomain S3 calls (for example http://bucketname.hostname), you must add the Ceph Object Gateway DNS name under the [client.rgw.gateway] section of the Ceph configuration file:

[client.rgw.gateway]
...
rgw dns name = HOST_NAME

You should also consider installing a DNS server such as Dnsmasq on your client machine(s) when using the http://BUCKET_NAME.HOST_NAME syntax. The dnsmasq.conf file should include the following settings:

address=/HOST_NAME/HOST_IP_ADDRESS
listen-address=CLIENT_LOOPBACK_IP

Then, add the CLIENT_LOOPBACK_IP IP address as the first DNS server on the client machine(s).

8.1.1.4 Create Data Directory Edit source

Deployment scripts may not create the default Ceph Object Gateway data directory. Create data directories for each instance of a radosgw daemon if not already done. The host variables in the Ceph configuration file determine which host runs each instance of a radosgw daemon. The typical form specifies the radosgw daemon, the cluster name, and the daemon ID.

root # mkdir -p /var/lib/ceph/radosgw/CLUSTER_ID

Using the example ceph.conf settings above, you would execute the following:

root # mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway

8.1.1.5 Restart Services and Start the Gateway Edit source

To ensure that all components have reloaded their configurations, we recommend restarting your Ceph Storage Cluster service. Then, start up the radosgw service. For more information, see Book “Administration Guide”, Chapter 3 “Introduction” and Book “Administration Guide”, Chapter 15 “Ceph Object Gateway”, Section 15.3 “Operating the Object Gateway Service”.

When the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:

<ListAllMyBucketsResult>
      <Owner>
              <ID>anonymous</ID>
              <DisplayName/>
      </Owner>
      <Buckets/>
</ListAllMyBucketsResult>

9 Installation of iSCSI Gateway Edit source

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage 6 includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VMware* vSphere, through the iSCSI protocol. Multipath iSCSI access enables availability and scalability for these clients, and the standardized iSCSI protocol also provides an additional layer of security isolation between clients and the SUSE Enterprise Storage 6 cluster. The configuration facility is named ceph-iscsi. Using ceph-iscsi, Ceph storage administrators can define thin-provisioned, replicated, highly-available volumes supporting read-only snapshots, read-write clones, and automatic resizing with Ceph RADOS Block Device (RBD). Administrators can then export volumes either via a single ceph-iscsi gateway host, or via multiple gateway hosts supporting multipath failover. Linux, Microsoft Windows, and VMware hosts can connect to volumes using the iSCSI protocol, which makes them available like any other SCSI block device. This means SUSE Enterprise Storage 6 customers can effectively run a complete block-storage infrastructure subsystem on Ceph that provides all the features and benefits of a conventional SAN, enabling future growth.

This chapter introduces detailed information to set up a Ceph cluster infrastructure together with an iSCSI gateway so that the client hosts can use remotely stored data as local storage devices using the iSCSI protocol.

9.1 iSCSI Block Storage Edit source

iSCSI is an implementation of the Small Computer System Interface (SCSI) command set using the Internet Protocol (IP), specified in RFC 3720. iSCSI is implemented as a service where a client (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target's IP address and port are called an iSCSI portal, where a target can be exposed through one or more portals. The combination of a target and one or more portals is called the target portal group (TPG).

The underlying data link layer protocol for iSCSI is commonly Ethernet. More specifically, modern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput. 10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster is strongly recommended.

9.1.1 The Linux Kernel iSCSI Target Edit source

The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's original domain and Web site. For some time, no fewer than four competing iSCSI target implementations were available for the Linux platform, but LIO ultimately prevailed as the single iSCSI reference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguous name "target", distinguishing between "target core" and a variety of front-end and back-end target modules.

The most commonly used front-end module is arguably iSCSI. However, LIO also supports Fibre Channel (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At this time, only the iSCSI protocol is supported by SUSE Enterprise Storage.

The most frequently used target back-end module is one that is capable of simply re-exporting any available block device on the target host. This module is named iblock. However, LIO also has an RBD-specific back-end module supporting parallelized multipath I/O access to RBD images.

9.1.2 iSCSI Initiators Edit source

This section introduces brief information on iSCSI initiators used on Linux, Microsoft Windows, and VMware platforms.

9.1.2.1 Linux Edit source

The standard initiator for the Linux platform is open-iscsi. open-iscsi launches a daemon, iscsid, which the user can then use to discover iSCSI targets on any given portal, log in to targets, and map iSCSI volumes. iscsid communicates with the SCSI mid layer to create in-kernel block devices that the kernel can then treat like any other SCSI block device on the system. The open-iscsi initiator can be deployed in conjunction with the Device Mapper Multipath (dm-multipath) facility to provide a highly available iSCSI block device.

9.1.2.2 Microsoft Windows and Hyper-V Edit source

The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSI initiator. The iSCSI service can be configured via a graphical user interface (GUI), and supports multipath I/O for high availability.

9.1.2.3 VMware Edit source

The default iSCSI initiator for VMware vSphere and ESX is the VMware ESX software iSCSI initiator, vmkiscsi. When enabled, it can be configured either from the vSphere client, or using the vmkiscsi-tool command. You can then format storage volumes connected through the vSphere iSCSI storage adapter with VMFS, and use them like any other VM storage device. The VMware initiator also supports multipath I/O for high availability.

9.2 General Information about ceph-iscsi Edit source

ceph-iscsi combines the benefits of RADOS Block Devices with the ubiquitous versatility of iSCSI. By employing ceph-iscsi on an iSCSI target host (known as the iSCSI Gateway), any application that needs to make use of block storage can benefit from Ceph, even if it does not speak any Ceph client protocol. Instead, users can use iSCSI or any other target front-end protocol to connect to an LIO target, which translates all target I/O to RBD storage operations.

Ceph Cluster with a Single iSCSI Gateway
Figure 9.1: Ceph Cluster with a Single iSCSI Gateway

ceph-iscsi is inherently highly-available and supports multipath operations. Thus, downstream initiator hosts can use multiple iSCSI gateways for both high availability and scalability. When communicating with an iSCSI configuration with more than one gateway, initiators may load-balance iSCSI requests across multiple gateways. In the event of a gateway failing, being temporarily unreachable, or being disabled for maintenance, I/O will transparently continue via another gateway.

Ceph Cluster with Multiple iSCSI Gateways
Figure 9.2: Ceph Cluster with Multiple iSCSI Gateways

9.3 Deployment Considerations Edit source

A minimum configuration of SUSE Enterprise Storage 6 with ceph-iscsi consists of the following components:

  • A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servers hosting at least eight object storage daemons (OSDs) each. In such a configuration, three OSD nodes also double as a monitor (MON) host.

  • An iSCSI target server running the LIO iSCSI target, configured via ceph-iscsi.

  • An iSCSI initiator host, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

A recommended production configuration of SUSE Enterprise Storage 6 with ceph-iscsi consists of:

  • A Ceph storage cluster. A production Ceph cluster consists of any number of (typically more than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs), with no fewer than three dedicated MON hosts.

  • Several iSCSI target servers running the LIO iSCSI target, configured via ceph-iscsi. For iSCSI fail-over and load-balancing, these servers must run a kernel supporting the target_core_rbd module. Update packages are available from the SUSE Linux Enterprise Server maintenance channel.

  • Any number of iSCSI initiator hosts, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

9.4 Installation and Configuration Edit source

This section describes steps to install and configure an iSCSI Gateway on top of SUSE Enterprise Storage.

9.4.1 Deploy the iSCSI Gateway to a Ceph Cluster Edit source

You can deploy the iSCSI Gateway either during Ceph cluster deployment process, or add it to an existing cluster using DeepSea.

To include the iSCSI Gateway during the cluster deployment process, refer to Section 4.5.1.2, “Role Assignment”.

To add the iSCSI Gateway to an existing cluster, refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.

9.4.2 Create RBD Images Edit source

RBD images are created in the Ceph store and subsequently exported to iSCSI. We recommend that you use a dedicated RADOS pool for this purpose. You can create a volume from any host that is able to connect to your storage cluster using the Ceph rbd command line utility. This requires the client to have at least a minimal ceph.conf configuration file, and appropriate CephX authentication credentials.

To create a new volume for subsequent export via iSCSI, use the rbd create command, specifying the volume size in megabytes. For example, in order to create a 100 GB volume named 'testvol' in the pool named 'iscsi-images', run:

cephadm > rbd --pool iscsi-images create --size=102400 'testvol'

9.4.3 Export RBD Images via iSCSI Edit source

To export RBD images via iSCSI, you can use either Ceph Dashboard Web interface or the ceph-iscsi gwcli utility. In this section we will focus on gwcli only, demonstrating how to create an iSCSI target that exports an RBD image using the command line.

Note
Note

Only the following RBD image features are supported: layering, striping (v2), exclusive-lock, fast-diff and data-pool. RBD images with any other feature enabled cannot be exported.

As root, start the iSCSI gateway command line interface:

root #  gwcli

Go to iscsi-targets and create a target with the name iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol:

gwcli >  /> cd /iscsi-targets
gwcli >  /iscsi-targets> create iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol

Create the iSCSI gateways by specifying the gateway name and ip address:

gwcli >  /iscsi-targets> cd iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/gateways
gwcli >  /iscsi-target...tvol/gateways> create iscsi1 192.168.124.104
gwcli >  /iscsi-target...tvol/gateways> create iscsi2 192.168.124.105
Tip
Tip

Use the help command to show the list of available commands in the current configuration node.

Add the RBD image with the name 'testvol' in the pool 'iscsi-images':

gwcli >  /iscsi-target...tvol/gateways> cd /disks
gwcli >  /disks> attach iscsi-images/testvol

Map the RBD image to the target:

gwcli >  /disks> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/disks
gwcli >  /iscsi-target...testvol/disks> add iscsi-images/testvol
Note
Note

You can use lower level tools, such as targetcli, to query the local configuration, but not to modify it.

Tip
Tip

You can use the ls command to review the configuration. Some configuration nodes also support the info command that can be used to display more detailed information.

Note that, by default, ACL authentication is enabled so this target is not accessible, yet. Check Section 9.4.4, “Authentication and Access Control” for more information about authentication and access control.

9.4.4 Authentication and Access Control Edit source

iSCSI authentication is flexible and covers many authentication possibilities.

9.4.4.1 No Authentication Edit source

'No authentication' means that any initiator will be able to access any LUNs on the corresponding target. You can enable 'No authentication' by disabling the ACL authentication:

gwcli >  /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hosts
gwcli >  /iscsi-target...testvol/hosts> auth disable_acl

9.4.4.2 ACL Authentication Edit source

When using initiator name based ACL authentication, only the defined initiators are allowed to connect. You can define an initiator by doing:

gwcli >  /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hosts
gwcli >  /iscsi-target...testvol/hosts> create iqn.1996-04.de.suse:01:e6ca28cc9f20

Defined initiators will be able to connect, but will only have access to the RBD images that were explicitly added to the initiator:

gwcli >  /iscsi-target...:e6ca28cc9f20> disk add rbd/testvol

9.4.4.3 CHAP Authentication Edit source

In addition to the ACL, you can enable the CHAP authentication by specifying a user name and password for each initiator:

gwcli >  /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol/hosts/iqn.1996-04.de.suse:01:e6ca28cc9f20
gwcli >  /iscsi-target...:e6ca28cc9f20> auth username=common12 password=pass12345678
Note
Note

User names must have a length of 8 to 64 characters and can only contain letters, '.', '@', '-', '_' or ':'.

Passwords must have a length of 12 to 16 characters and can only contain letters, '@', '-', '_' or '/'.

Optionally, you can also enable the CHAP mutual authentication by specifying the mutual_username and mutual_password parameters in the auth command.

9.4.4.4 Discovery and Mutual Authentication Edit source

Discovery authentication is independent of the previous authentication methods. It requires credentials for browsing, it is optional, and can be configured by:

gwcli >  /> cd /iscsi-targets
gwcli >  /iscsi-targets> discovery_auth username=du123456 password=dp1234567890
Note
Note

User-names must have a length of 8 to 64 characters and can only contain letters, '.', '@', '-', '_' or ':'.

Passwords must have a length of 12 to 16 characters and can only contain letters, '@', '-', '_' or '/'.

Optionally, you can also specify the mutual_username and mutual_password parameters in the discovery_auth command.

Discovery authentication can be disabled by using the following command:

gwcli >  /iscsi-targets> discovery_auth nochap

9.4.5 Advanced Settings Edit source

ceph-iscsi can be configured with advanced parameters which are subsequently passed on to the LIO I/O target. The parameters are divided up into 'target' and 'disk' parameters.

Warning
Warning

Unless otherwise noted, changing these parameters from the default setting is not recommended.

9.4.5.1 Target Settings Edit source

You can view the value of these settings by using the info command:

gwcli >  /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol
gwcli >  /iscsi-target...i.x86:testvol> info

And change a setting using the reconfigure command:

gwcli >  /iscsi-target...i.x86:testvol> reconfigure login_timeout 20

The available 'target' settings are:

default_cmdsn_depth

Default CmdSN (Command Sequence Number) depth. Limits the amount of requests that an iSCSI initiator can have outstanding at any moment.

default_erl

Default error recovery level.

login_timeout

Login timeout value in seconds.

netif_timeout

NIC failure timeout in seconds.

prod_mode_write_protect

If set to 1, prevents writes to LUNs.

9.4.5.2 Disk Settings Edit source

You can view the value of these settings by using the info command:

gwcli >  /> cd /disks/rbd/testvol
gwcli >  /disks/rbd/testvol> info

And change a setting using the reconfigure command:

gwcli >  /disks/rbd/testvol> reconfigure rbd/testvol emulate_pr 0

The available 'disk' settings are:

block_size

Block size of the underlying device.

emulate_3pc

If set to 1, enables Third Party Copy.

emulate_caw

If set to 1, enables Compare and Write.

emulate_dpo

If set to 1, turns on Disable Page Out.

emulate_fua_read

If set to 1, enables Force Unit Access read.

emulate_fua_write

If set to 1, enables Force Unit Access write.

emulate_model_alias

If set to 1, uses the back-end device name for the model alias.

emulate_pr

If set to 0, support for SCSI Reservations, including Persistent Group Reservations, is disabled. While disabled, the SES iSCSI Gateway can ignore reservation state, resulting in improved request latency.

Tip
Tip

Setting backstore_emulate_pr to 0 is recommended if iSCSI initiators do not require SCSI Reservation support.

emulate_rest_reord

If set to 0, the Queue Algorithm Modifier has Restricted Reordering.

emulate_tas

If set to 1, enables Task Aborted Status.

emulate_tpu

If set to 1, enables Thin Provisioning Unmap.

emulate_tpws

If set to 1, enables Thin Provisioning Write Same.

emulate_ua_intlck_ctrl

If set to 1, enables Unit Attention Interlock.

emulate_write_cache

If set to 1, turns on Write Cache Enable.

enforce_pr_isids

If set to 1, enforces persistent reservation ISIDs.

is_nonrot

If set to 1, the backstore is a non-rotational device.

max_unmap_block_desc_count

Maximum number of block descriptors for UNMAP.

max_unmap_lba_count:

Maximum number of LBAs for UNMAP.

max_write_same_len

Maximum length for WRITE_SAME.

optimal_sectors

Optimal request size in sectors.

pi_prot_type

DIF protection type.

queue_depth

Queue depth.

unmap_granularity

UNMAP granularity.

unmap_granularity_alignment

UNMAP granularity alignment.

force_pr_aptpl

When enabled, LIO will always write out the persistent reservation state to persistent storage, regardless of whether or not the client has requested it via aptpl=1. This has no effect with the kernel RBD back-end for LIO—it always persists PR state. Ideally, the target_core_rbd option should force it to '1' and throw an error if someone tries to disable it via configfs.

unmap_zeroes_data

Affects whether LIO will advertise LBPRZ to SCSI initiators, indicating that zeros will be read back from a region following UNMAP or WRITE SAME with an unmap bit.

9.5 Exporting RADOS Block Device Images using tcmu-runner Edit source

The ceph-iscsi supports both rbd (kernel-based), and user:rbd (tcmu-runner) backstores making all the management transparent and independent of the backstore.

Warning
Warning: Technology Preview

tcmu-runner based iSCSI Gateway deployments are currently a technology preview.

Unlike kernel-based iSCSI Gateway deployments, tcmu-runner based iSCSI Gateways do not offer support for multipath I/O or SCSI Persistent Reservations.

To export an RADOS Block Device image using tcmu-runner, all you need to do is specify the user:rbd backstore when attaching the disk:

gwcli >  /disks> attach rbd/testvol backstore=user:rbd
Note
Note

When using tcmu-runner, the exported RBD image must have the exclusive-lock feature enabled.

10 Installation of CephFS Edit source

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.

10.1 Supported CephFS Scenarios and Guidance Edit source

With SUSE Enterprise Storage 6, SUSE introduces official support for many scenarios in which the scale-out and distributed component CephFS is used. This entry describes hard limits and provides guidance for the suggested use cases.

A supported CephFS deployment must meet these requirements:

  • A minimum of one Metadata Server. SUSE recommends to deploy several nodes with the MDS role. Only one will be active and the rest will be passive. Remember to mention all the MON nodes in the mount command when mounting the CephFS from a client.

  • Clients are SUSE Linux Enterprise Server 12 SP3 or newer, or SUSE Linux Enterprise Server 15 or newer, using the cephfs kernel module driver. The FUSE module is not supported.

  • CephFS quotas are supported in SUSE Enterprise Storage 6 and can be set on any subdirectory of the Ceph file system. The quota restricts either the number of bytes or files stored beneath the specified point in the directory hierarchy. For more information, see Book “Administration Guide”, Chapter 17 “Clustered File System”, Section 17.6 “Setting CephFS Quotas”.

  • CephFS supports file layout changes as documented in Section 10.3.4, “File Layouts”. However, while the file system is mounted by any client, new data pools may not be added to an existing CephFS file system (ceph mds add_data_pool). They may only be added while the file system is unmounted.

  • A minimum of one Metadata Server. We recommend deploying several nodes with the MDS role. By default additional MDS daemons start as standby daemons, acting as backups for the active MDS. Multiple active MDS daemons are also supported (refer to section Section 10.3.2, “MDS Cluster Size”).

10.2 Ceph Metadata Server Edit source

Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Ceph object storage do not use MDS. MDSs make it possible for POSIX file system users to execute basic commands—such as ls or find—without placing an enormous burden on the Ceph storage cluster.

10.2.1 Adding a Metadata Server Edit source

You can deploy MDS either during the initial cluster deployment process as described in Section 4.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.1 “Adding New Cluster Nodes”.

After you deploy your MDS, allow the Ceph OSD/MDS service in the firewall setting of the server where MDS is deployed: Start yast, navigate to Security and Users › Firewall › Allowed Services and in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node is not allowed full traffic, mounting of a file system fails, even though other operations may work properly.

10.2.2 Configuring a Metadata Server Edit source

You can fine-tune the MDS behavior by inserting relevant options in the ceph.conf configuration file.

Metadata Server Settings
mon force standby active

If set to 'true' (default), monitors force standby-replay to be active. Set under [mon] or [global] sections.

mds cache memory limit

The soft memory limit (in bytes) that the MDS will enforce for its cache. Administrators should use this instead of the old mds cache size setting. Defaults to 1GB.

mds cache reservation

The cache reservation (memory or inodes) for the MDS cache to maintain. When the MDS begins touching its reservation, it will recall client state until its cache size shrinks to restore the reservation. Defaults to 0.05.

mds cache size

The number of inodes to cache. A value of 0 (default) indicates an unlimited number. It is recommended to use mds cache memory limit to limit the amount of memory the MDS cache uses.

mds cache mid

The insertion point for new items in the cache LRU (from the top). Default is 0.7

mds dir commit ratio

The fraction of directory that is dirty before Ceph commits using a full update instead of partial update. Default is 0.5

mds dir max commit size

The maximum size of a directory update before Ceph breaks it into smaller transactions. Default is 90 MB.

mds decay halflife

The half-life of MDS cache temperature. Default is 5.

mds beacon interval

The frequency in seconds of beacon messages sent to the monitor. Default is 4.

mds beacon grace

The interval without beacons before Ceph declares an MDS laggy and possibly replace it. Default is 15.

mds blacklist interval

The blacklist duration for failed MDSs in the OSD map. This setting controls how long failed MDS daemons will stay in the OSD map blacklist. It has no effect on how long something is blacklisted when the administrator blacklists it manually. For example, the ceph osd blacklist add command will still use the default blacklist time. Default is 24 * 60.

mds reconnect timeout

The interval in seconds to wait for clients to reconnect during MDS restart. Default is 45.

mds tick interval

How frequently the MDS performs internal periodic tasks. Default is 5.

mds dirstat min interval

The minimum interval in seconds to try to avoid propagating recursive stats up the tree. Default is 1.

mds scatter nudge interval

How quickly dirstat changes propagate up. Default is 5.

mds client prealloc inos

The number of inode numbers to preallocate per client session. Default is 1000.

mds early reply

Determines whether the MDS should allow clients to see request results before they commit to the journal. Default is 'true'.

mds use tmap

Use trivial map for directory updates. Default is 'true'.

mds default dir hash

The function to use for hashing files across directory fragments. Default is 2 (that is 'rjenkins').

mds log skip corrupt events

Determines whether the MDS should try to skip corrupt journal events during journal replay. Default is 'false'.

mds log max events

The maximum events in the journal before we initiate trimming. Set to -1 (default) to disable limits.

mds log max segments

The maximum number of segments (objects) in the journal before we initiate trimming. Set to -1 to disable limits. Default is 30.

mds log max expiring

The maximum number of segments to expire in parallels. Default is 20.

mds log eopen size

The maximum number of inodes in an EOpen event. Default is 100.

mds bal sample interval

Determines how frequently to sample directory temperature for fragmentation decisions. Default is 3.

mds bal replicate threshold

The maximum temperature before Ceph attempts to replicate metadata to other nodes. Default is 8000.

mds bal unreplicate threshold

The minimum temperature before Ceph stops replicating metadata to other nodes. Default is 0.

mds bal split size

The maximum directory size before the MDS will split a directory fragment into smaller bits. Default is 10000.

mds bal split rd

The maximum directory read temperature before Ceph splits a directory fragment. Default is 25000.

mds bal split wr

The maximum directory write temperature before Ceph splits a directory fragment. Default is 10000.

mds bal split bits

The number of bits by which to split a directory fragment. Default is 3.

mds bal merge size

The minimum directory size before Ceph tries to merge adjacent directory fragments. Default is 50.

mds bal interval

The frequency in seconds of workload exchanges between MDSs. Default is 10.

mds bal fragment interval

The delay in seconds between a fragment is capable of splitting or merging, and execution of the fragmentation change. Default is 5.

mds bal fragment fast factor

The ratio by which fragments may exceed the split size before a split is executed immediately, skipping the fragment interval. Default is 1.5

mds bal fragment size max

The maximum size of a fragment before any new entries are rejected with ENOSPC. Default is 100000.

mds bal idle threshold

The minimum temperature before Ceph migrates a subtree back to its parent. Default is 0.

mds bal mode

The method for calculating MDS load:

  • 0 = Hybrid.

  • 1 = Request rate and latency.

  • 2 = CPU load.

Default is 0.

mds bal min rebalance

The minimum subtree temperature before Ceph migrates. Default is 0.1

mds bal min start

The minimum subtree temperature before Ceph searches a subtree. Default is 0.2

mds bal need min

The minimum fraction of target subtree size to accept. Default is 0.8

mds bal need max

The maximum fraction of target subtree size to accept. Default is 1.2

mds bal midchunk

Ceph will migrate any subtree that is larger than this fraction of the target subtree size. Default is 0.3

mds bal minchunk

Ceph will ignore any subtree that is smaller than this fraction of the target subtree size. Default is 0.001

mds bal target removal min

The minimum number of balancer iterations before Ceph removes an old MDS target from the MDS map. Default is 5.

mds bal target removal max

The maximum number of balancer iteration before Ceph removes an old MDS target from the MDS map. Default is 10.

mds replay interval

The journal poll interval when in standby-replay mode ('hot standby'). Default is 1.

mds shutdown check

The interval for polling the cache during MDS shutdown. Default is 0.

mds thrash fragments

Ceph will randomly fragment or merge directories. Default is 0.

mds dump cache on map

Ceph will dump the MDS cache contents to a file on each MDS map. Default is 'false'.

mds dump cache after rejoin

Ceph will dump MDS cache contents to a file after rejoining the cache during recovery. Default is 'false'.

mds standby for name

An MDS daemon will standby for another MDS daemon of the name specified in this setting.

mds standby for rank

An MDS daemon will standby for an MDS daemon of this rank. Default is -1.

mds standby replay

Determines whether a Ceph MDS daemon should poll and replay the log of an active MDS ('hot standby'). Default is 'false'.

mds min caps per client

Set the minimum number of capabilities a client may hold. Default is 100.

mds max ratio caps per client

Set the maximum ratio of current caps that may be recalled during MDS cache pressure. Default is 0.8

Metadata Server Journaler Settings
journaler write head interval

How frequently to update the journal head object. Default is 15.

journaler prefetch periods

How many stripe periods to read-ahead on journal replay. Default is 10.

journal prezero periods

How many stripe periods to zero ahead of write position. Default 10.

journaler batch interval

Maximum additional latency in seconds we incur artificially. Default is 0.001

journaler batch max

Maximum bytes we will delay flushing. Default is 0.

10.3 CephFS Edit source

When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you can create and mount your Ceph file system. Ensure that your client has network connectivity and a proper authentication keyring.

10.3.1 Creating CephFS Edit source

A CephFS requires at least two RADOS pools: one for data and one for metadata. When configuring these pools, you might consider:

  • Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole file system inaccessible.

  • Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of file system operations on clients.

When assigning a role-mds in the policy.cfg, the required pools are automatically created. You can manually create the pools cephfs_data and cephfs_metadata for manual performance tuning before setting up the Metadata Server. DeepSea will not create these pools if they already exist.

For more information on managing pools, see Book “Administration Guide”, Chapter 9 “Managing Storage Pools”.

To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:

cephadm > ceph osd pool create cephfs_data pg_num
cephadm > ceph osd pool create cephfs_metadata pg_num

It is possible to use EC pools instead of replicated pools. We recommend to only use EC pools for low performance requirements and infrequent random access, for example cold storage, backups, archiving. CephFS on EC pools requires BlueStore to be enabled and the pool must have the allow_ec_overwrite option set. This option can be set by running ceph osd pool set ec_pool allow_ec_overwrites true.

Erasure coding adds significant overhead to file system operations, especially small updates. This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penalty is the trade off for significantly reduced storage space overhead.

When the pools are created, you may enable the file system with the ceph fs new command:

cephadm > ceph fs new fs_name metadata_pool_name data_pool_name

For example:

cephadm > ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the file system was created by listing all available CephFSs:

cephadm > ceph fs ls
 name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the file system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:

cephadm > ceph mds stat
e5: 1/1/1 up
Tip
Tip: More Topics

You can find more information of specific tasks—for example mounting, unmounting, and advanced CephFS setup—in Book “Administration Guide”, Chapter 17 “Clustered File System”.

10.3.2 MDS Cluster Size Edit source

A CephFS instance can be served by multiple active MDS daemons. All active MDS daemons that are assigned to a CephFS instance will distribute the file system's directory tree between themselves, and thus spread the load of concurrent clients. In order to add an active MDS daemon to a CephFS instance, a spare standby is needed. Either start an additional daemon or use an existing standby instance.

The following command will display the current number of active and passive MDS daemons.

cephadm > ceph mds stat

The following command sets the number of active MDS's to two in a file system instance.

cephadm > ceph fs set fs_name max_mds 2

In order to shrink the MDS cluster prior to an update, two steps are necessary. First set max_mds so that only one instance remains:

cephadm > ceph fs set fs_name max_mds 1

and after that explicitly deactivate the other active MDS daemons:

cephadm > ceph mds deactivate fs_name:rank

where rank is the number of an active MDS daemon of a file system instance, ranging from 0 to max_mds-1.

We recommend at least one MDS is left as a standby daemon.

10.3.3 MDS Cluster and Updates Edit source

During Ceph updates, the feature flags on a file system instance may change (usually by adding new features). Incompatible daemons (such as the older versions) are not able to function with an incompatible feature set and will refuse to start. This means that updating and restarting one daemon can cause all other not yet updated daemons to stop and refuse to start. For this reason we, recommend shrinking the active MDS cluster to size one and stopping all standby daemons before updating Ceph. The manual steps for this update procedure are as follows:

  1. Update the Ceph related packages using zypper.

  2. Shrink the active MDS cluster as described above to 1 instance and stop all standby MDS daemons using their systemd units on all other nodes:

    root # systemctl stop ceph-mds\*.service ceph-mds.target
  3. Only then restart the single remaining MDS daemon, causing it to restart using the updated binary.

    root # systemctl restart ceph-mds\*.service ceph-mds.target
  4. Restart all other MDS daemons and re-set the desired max_mds setting.

    root # systemctl start ceph-mds.target

If you use DeepSea, it will follow this procedure in case the ceph package was updated during Stages 0 and 4. It is possible to perform this procedure while clients have the CephFS instance mounted and I/O is ongoing. Note however that there will be a very brief I/O pause while the active MDS restarts. Clients will recover automatically.

It is good practice to reduce the I/O load as much as possible before updating an MDS cluster. An idle MDS cluster will go through this update procedure quicker. Conversely, on a heavily loaded cluster with multiple MDS daemons it is essential to reduce the load in advance to prevent a single MDS daemon from being overwhelmed by ongoing I/O.

10.3.4 File Layouts Edit source

The layout of a file controls how its contents are mapped to Ceph RADOS objects. You can read and write a file’s layout using virtual extended attributes or shortly xattrs.

The name of the layout xattrs depends on whether a file is a regular file or a directory. Regular files’ layout xattrs are called ceph.file.layout, while directories’ layout xattrs are called ceph.dir.layout. Where examples refer to ceph.file.layout, substitute the .dir. part as appropriate when dealing with directories.

10.3.4.1 Layout Fields Edit source

The following attribute fields are recognized:

pool

ID or name of a RADOS pool in which a file’s data objects will be stored.

pool_namespace

RADOS namespace within a data pool to which the objects will be written. It is empty by default, meaning the default namespace.

stripe_unit

The size in bytes of a block of data used in the RAID 0 distribution of a file. All stripe units for a file have equal size. The last stripe unit is typically incomplete—it represents the data at the end of the file as well as the unused 'space' beyond it up to the end of the fixed stripe unit size.

stripe_count

The number of consecutive stripe units that constitute a RAID 0 'stripe' of file data.

object_size

The size in bytes of RADOS objects into which the file data is chunked.

Tip
Tip: Object Sizes

RADOS enforces a configurable limit on object sizes. If you increase CephFS object sizes beyond that limit then writes may not succeed. The OSD setting is osd_max_object_size, which is 128MB by default. Very large RADOS objects may prevent smooth operation of the cluster, so increasing the object size limit past the default is not recommended.

10.3.4.2 Reading Layout with getfattr Edit source

Use the getfattr command to read the layout information of an example file file as a single string:

root # touch file
root # getfattr -n ceph.file.layout file
# file: file
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=419430

Read individual layout fields:

root # getfattr -n ceph.file.layout.pool file
# file: file
ceph.file.layout.pool="cephfs_data"
root # getfattr -n ceph.file.layout.stripe_unit file
# file: file
ceph.file.layout.stripe_unit="4194304"
Tip
Tip: Pool ID or Name

When reading layouts, the pool will usually be indicated by name. However, in rare cases when pools have only just been created, the ID may be output instead.

Directories do not have an explicit layout until it is customized. Attempts to read the layout will fail if it has never been modified: this indicates that layout of the next ancestor directory with an explicit layout will be used.

root # mkdir dir
root # getfattr -n ceph.dir.layout dir
dir: ceph.dir.layout: No such attribute
root # setfattr -n ceph.dir.layout.stripe_count -v 2 dir
root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"

10.3.4.3 Writing Layouts with setfattr Edit source

Use the setfattr command to modify the layout fields of an example file file:

root # ceph osd lspools
0 rbd
1 cephfs_data
2 cephfs_metadata
root # setfattr -n ceph.file.layout.stripe_unit -v 1048576 file
root # setfattr -n ceph.file.layout.stripe_count -v 8 file
# Setting pool by ID:
root # setfattr -n ceph.file.layout.pool -v 1 file
# Setting pool by name:
root # setfattr -n ceph.file.layout.pool -v cephfs_data file
Note
Note: Empty File

When the layout fields of a file are modified using setfattr, this file needs to be empty otherwise an error will occur.

10.3.4.4 Clearing Layouts Edit source

If you want to remove an explicit layout from an example directory mydir and revert back to inheriting the layout of its ancestor, run the following:

root # setfattr -x ceph.dir.layout mydir

Similarly, if you have set the 'pool_namespace' attribute and wish to modify the layout to use the default namespace instead, run:

# Create a directory and set a namespace on it
root # mkdir mydir
root # setfattr -n ceph.dir.layout.pool_namespace -v foons mydir
root # getfattr -n ceph.dir.layout mydir
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \
 pool=cephfs_data_a pool_namespace=foons"

# Clear the namespace from the directory's layout
root # setfattr -x ceph.dir.layout.pool_namespace mydir
root # getfattr -n ceph.dir.layout mydir
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \
 pool=cephfs_data_a"

10.3.4.5 Inheritance of Layouts Edit source

Files inherit the layout of their parent directory at creation time. However, subsequent changes to the parent directory’s layout do not affect children:

root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \
 pool=cephfs_data"

# file1 inherits its parent's layout
root # touch dir/file1
root # getfattr -n ceph.file.layout dir/file1
# file: dir/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \
 pool=cephfs_data"

# update the layout of the directory before creating a second file
root # setfattr -n ceph.dir.layout.stripe_count -v 4 dir
root # touch dir/file2

# file1's layout is unchanged
root # getfattr -n ceph.file.layout dir/file1
# file: dir/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \
 pool=cephfs_data"

# ...while file2 has the parent directory's new layout
root # getfattr -n ceph.file.layout dir/file2
# file: dir/file2
ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \
 pool=cephfs_data"

Files created as descendants of the directory also inherit its layout if the intermediate directories do not have layouts set:

root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \
 pool=cephfs_data"
root # mkdir dir/childdir
root # getfattr -n ceph.dir.layout dir/childdir
dir/childdir: ceph.dir.layout: No such attribute
root # touch dir/childdir/grandchild
root # getfattr -n ceph.file.layout dir/childdir/grandchild
# file: dir/childdir/grandchild
ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \
 pool=cephfs_data"

10.3.4.6 Adding a Data Pool to the Metadata Server Edit source

Before you can use a pool with CephFS, you need to add it to the Metadata Server:

root # ceph fs add_data_pool cephfs cephfs_data_ssd
root # ceph fs ls  # Pool should now show up
.... data pools: [cephfs_data cephfs_data_ssd ]
Tip
Tip: cephx Keys

Make sure that your cephx keys allows the client to access this new pool.

You can then update the layout on a directory in CephFS to use the pool you added:

root # mkdir /mnt/cephfs/myssddir
root # setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir

All new files created within that directory will now inherit its layout and place their data in your newly added pool. You may notice that the number of objects in your primary data pool continue to increase, even if files are being created in the pool you newly added. This is normal: the file data is stored in the pool specified by the layout, but a small amount of metadata is kept in the primary data pool for all files.

11 Installation of NFS Ganesha Edit source

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.

Warning
Warning: Cross Protocol Access

Native CephFS and NFS clients are not restricted by file locks obtained via Samba, and vice-versa. Applications that rely on cross protocol file locking may experience data corruption if CephFS backed Samba share paths are accessed via other means.

11.1 Preparation Edit source

11.1.1 General Information Edit source

To successfully deploy NFS Ganesha, you need to add a role-ganesha to your /srv/pillar/ceph/proposals/policy.cfg. For details, see Section 4.5.1, “The policy.cfg File”. NFS Ganesha also needs either a role-rgw or a role-mds present in the policy.cfg.

Although it is possible to install and run the NFS Ganesha server on an already existing Ceph node, we recommend running it on a dedicated host with access to the Ceph cluster. The client hosts are typically not part of the cluster, but they need to have network access to the NFS Ganesha server.

To enable the NFS Ganesha server at any point after the initial installation, add the role-ganesha to the policy.cfg and re-run at least DeepSea stages 2 and 4. For details, see Section 4.3, “Cluster Deployment”.

NFS Ganesha is configured via the file /etc/ganesha/ganesha.conf that exists on the NFS Ganesha node. However, this file is overwritten each time DeepSea stage 4 is executed. Therefore we recommend to edit the template used by Salt, which is the file /srv/salt/ceph/ganesha/files/ganesha.conf.j2 on the Salt master. For details about the configuration file, see Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”, Section 19.2 “Configuration”.

11.1.2 Summary of Requirements Edit source

The following requirements need to be met before DeepSea stages 2 and 4 can be executed to install NFS Ganesha:

  • At least one node needs to be assigned the role-ganesha.

  • You can define only one role-ganesha per minion.

  • NFS Ganesha needs either an Object Gateway or CephFS to work.

  • The kernel based NFS needs to be disabled on minions with the role-ganesha role.

11.2 Example Installation Edit source

This procedure provides an example installation that uses both the Object Gateway and CephFS File System Abstraction Layers (FSAL) of NFS Ganesha.

  1. If you have not done so, execute DeepSea stages 0 and 1 before continuing with this procedure.

    root@master # salt-run state.orch ceph.stage.0
    root@master # salt-run state.orch ceph.stage.1
  2. After having executed stage 1 of DeepSea, edit the /srv/pillar/ceph/proposals/policy.cfg and add the line

    role-ganesha/cluster/NODENAME

    Replace NODENAME with the name of a node in your cluster.

    Also make sure that a role-mds and a role-rgw are assigned.

  3. Execute at least stages 2 and 4 of DeepSea. Running stage 3 in between is recommended.

    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3 # optional but recommended
    root@master # salt-run state.orch ceph.stage.4
  4. Verify that NFS Ganesha is working by checking that NFS Ganesha service is running on the minion node:

    root # salt -I roles:ganesha service.status nfs-ganesha
    MINION_ID:
        True

11.3 High Availability Active-Passive Configuration Edit source

This section provides an example of how to set up a two-node active-passive configuration of NFS Ganesha servers. The setup requires the SUSE Linux Enterprise High Availability Extension. The two nodes are called earth and mars.

Important
Important: Co-location of Services

Services that have their own fault tolerance and their own load balancing should not be running on cluster nodes that get fenced for failover services. Therefore, do not run Ceph Monitor, Metadata Server, iSCSI, or Ceph OSD services on High Availability setups.

For details about SUSE Linux Enterprise High Availability Extension, see https://www.suse.com/documentation/sle-ha-15/.

11.3.1 Basic Installation Edit source

In this setup earth has the IP address 192.168.1.1 and mars has the address 192.168.1.2.

Additionally, two floating virtual IP addresses are used, allowing clients to connect to the service independent of which physical node it is running on. 192.168.1.10 is used for cluster administration with Hawk2 and 192.168.2.1 is used exclusively for the NFS exports. This makes it easier to apply security restrictions later.

The following procedure describes the example installation. More details can be found at https://www.suse.com/documentation/sle-ha-15/book_sleha_quickstarts/data/art_sleha_install_quick.html.

  1. Prepare the NFS Ganesha nodes on the Salt master:

    1. Run DeepSea stages 0 and 1.

      root@master # salt-run state.orch ceph.stage.0
      root@master # salt-run state.orch ceph.stage.1
    2. Assign the nodes earth and mars the role-ganesha in the /srv/pillar/ceph/proposals/policy.cfg:

      role-ganesha/cluster/earth*.sls
      role-ganesha/cluster/mars*.sls
    3. Run DeepSea stages 2 to 4.

      root@master # salt-run state.orch ceph.stage.2
      root@master # salt-run state.orch ceph.stage.3
      root@master # salt-run state.orch ceph.stage.4
  2. Register the SUSE Linux Enterprise High Availability Extension on earth and mars.

    root # SUSEConnect -r ACTIVATION_CODE -e E_MAIL
  3. Install ha-cluster-bootstrap on both nodes:

    root # zypper in ha-cluster-bootstrap
    1. Initialize the cluster on earth:

      root@earth # ha-cluster-init
    2. Let mars join the cluster:

      root@mars # ha-cluster-join -c earth
  4. Check the status of the cluster. You should see two nodes added to the cluster:

    root@earth # crm status
  5. On both nodes, disable the automatic start of the NFS Ganesha service at boot time:

    root # systemctl disable nfs-ganesha
  6. Start the crm shell on earth:

    root@earth # crm configure

    The next commands are executed in the crm shell.

  7. On earth, run the crm shell to execute the following commands to configure the resource for NFS Ganesha daemons as clone of systemd resource type:

    crm(live)configure# primitive nfs-ganesha-server systemd:nfs-ganesha \
    op monitor interval=30s
    crm(live)configure# clone nfs-ganesha-clone nfs-ganesha-server meta interleave=true
    crm(live)configure# commit
    crm(live)configure# status
        2 nodes configured
        2 resources configured
    
        Online: [ earth mars ]
    
        Full list of resources:
             Clone Set: nfs-ganesha-clone [nfs-ganesha-server]
             Started:  [ earth mars ]
  8. Create a primitive IPAddr2 with the crm shell:

    crm(live)configure# primitive ganesha-ip IPaddr2 \
    params ip=192.168.2.1 cidr_netmask=24 nic=eth0 \
    op monitor interval=10 timeout=20
    
    crm(live)# status
    Online: [ earth mars  ]
    Full list of resources:
     Clone Set: nfs-ganesha-clone [nfs-ganesha-server]
         Started: [ earth mars ]
     ganesha-ip    (ocf::heartbeat:IPaddr2):    Started earth
  9. To set up a relationship between the NFS Ganesha server and the floating Virtual IP, we use collocation and ordering.

    crm(live)configure# colocation ganesha-ip-with-nfs-ganesha-server inf: ganesha-ip nfs-ganesha-clone
    crm(live)configure# order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-clone ganesha-ip
  10. Use the mount command from the client to ensure that cluster setup is complete:

    root # mount -t nfs -v -o sync,nfsvers=4 192.168.2.1:/ /mnt

11.3.2 Clean Up Resources Edit source

In the event of an NFS Ganesha failure at one of the node, for example earth, fix the issue and clean up the resource. Only after the resource is cleaned up can the resource fail back to earth in case NFS Ganesha fails at mars.

To clean up the resource:

root@earth # crm resource cleanup nfs-ganesha-clone earth
root@earth # crm resource cleanup ganesha-ip earth

11.3.3 Setting Up Ping Resource Edit source

It may happen that the server is unable to reach the client because of a network issue. A ping resource can detect and mitigate this problem. Configuring this resource is optional.

  1. Define the ping resource:

    crm(live)configure# primitive ganesha-ping ocf:pacemaker:ping \
            params name=ping dampen=3s multiplier=100 host_list="CLIENT1 CLIENT2" \
            op monitor interval=60 timeout=60 \
            op start interval=0 timeout=60 \
            op stop interval=0 timeout=60

    host_list is a list of IP addresses separated by space characters. The IP addresses will be pinged regularly to check for network outages. If a client must always have access to the NFS server, add it to host_list.

  2. Create a clone:

    crm(live)configure# clone ganesha-ping-clone ganesha-ping \
            meta interleave=true
  3. The following command creates a constraint for the NFS Ganesha service. It forces the service to move to another node when host_list is unreachable.

    crm(live)configure# location nfs-ganesha-server-with-ganesha-ping
            nfs-ganesha-clone \
            rule -inf: not_defined ping or ping lte 0

11.3.4 NFS Ganesha HA and DeepSea Edit source

DeepSea does not support configuring NFS Ganesha HA. To prevent DeepSea from failing after NFS Ganesha HA was configured, exclude starting and stopping the NFS Ganesha service from DeepSea Stage 4:

  1. Copy /srv/salt/ceph/ganesha/default.sls to /srv/salt/ceph/ganesha/ha.sls.

  2. Remove the .service entry from /srv/salt/ceph/ganesha/ha.sls so that it looks as follows:

    include:
    - .keyring
    - .install
    - .configure
  3. Add the following line to /srv/pillar/ceph/stack/global.yml:

    ganesha_init: ha

11.4 Active-Active Configuration Edit source

This section provides an example of simple active-active NFS Ganesha setup. The aim is to deploy two NFS Ganesha servers layered on top of the same existing CephFS. The servers will be two Ceph cluster nodes with separate addresses. The clients need to be distributed between them manually. Failover in this configuration means manually unmounting and remounting the other server on the client.

11.4.1 Prerequisites Edit source

For our example configuration, you need the following:

  • Running Ceph cluster. See Section 4.3, “Cluster Deployment” for details on deploying and configuring Ceph cluster by using DeepSea.

  • At least one configured CephFS. See Chapter 10, Installation of CephFS for more details on deploying and configuring CephFS.

  • Two Ceph cluster nodes with NFS Ganesha deployed. See Chapter 11, Installation of NFS Ganesha for more details on deploying NFS Ganesha.

    Tip
    Tip: Use Dedicated Servers

    Although NFS Ganesha nodes can share resources with other Ceph related services, we recommend to use dedicated servers to improve performance.

After you deploy the NFS Ganesha nodes, verify that the cluster is operational and the default CephFS pools are there:

cephadm > rados lspools
cephfs_data
cephfs_metadata

11.4.2 Configure NFS Ganesha Edit source

Check that both NFS Ganesha nodes have the file /etc/ganesha/ganesha.conf files installed. The nfs-ganesha-ceph package ships with a sample /etc/ganesha/ceph.conf file that you can tweak as needed. Following is an example of ceph.conf:

NFS_CORE_PARAM
{
    Enable_NLM = false;
    Enable_RQUOTA = false;
    Protocols = 4;
}
NFSv4
{
    RecoveryBackend = rados_cluster;
    Minor_Versions = 1,2;
}
CACHEINODE {
    Dir_Chunk = 0;
    NParts = 1;
    Cache_Size = 1;
}
EXPORT
{
    Export_ID=100;
    Protocols = 4;
    Transports = TCP;
    Path = /;
    Pseudo = /ceph/;
    Access_Type = RW;
    Attr_Expiration_Time = 0;
    FSAL {
        Name = CEPH;
    }
}
RADOS_KV
{
    pool = "cephfs_metadata";
    namespace = "ganesha";
    #nodeid = "a";
}

Because legacy versions of NFS prevent us from lifting the grace period early and therefore prolong a server restart, we disable options for NFS prior to version 4.2. We also disable most of the NFS Ganesha caching as Ceph libraries do aggressive caching already.

The 'rados_cluster' recovery back-end stores its info in RADOS objects. Although it is not a lot of data, we want it highly available. We use the CephFS metadata pool for this purpose, and declare a new 'ganesha' namespace in it to keep it distinct from CephFS objects.

Note
Note: Cluster Node Id's

Most of the configuration is identical between the two hosts, however the nodeid option in the 'RADOS_KV' block needs to be a unique string for each node. By default, NFS Ganesha sets nodeid to the host name of the node.

If you need to use a different fixed values other than host names, you can for example set nodeid = 'a' on one node and nodeid = 'b' on the other one.

11.4.3 Populate the Cluster Grace Database Edit source

We need to verify that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. NFS Ganesha uses this object to communicate the current state with regard to a grace period.

The nfs-ganesha-rados-grace package contains a command line tool for querying and manipulating this database. If the package is not installed on at least one of the nodes, install it with

root # zypper install nfs-ganesha-rados-grace

We will use the command to create the DB and add both nodeid's. In our example, the two NFS Ganesha nodes are named ses6min1.example.com and ses6min2.example.com One of the NFS Ganesha hosts, run

cephadm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min1.example.com
cephadm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min2.example.com
cephadm > ganesha-rados-grace -p cephfs_metadata -n ganesha
cur=1 rec=0
======================================================
ses6min1.example.com     E
ses6min2.example.com     E

This creates the grace database and adds both 'ses6min1.example.com' and 'ses6min2.example.com' to it. The last command dumps the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the 'E' flag set. The 'cur' and 'rec' values show the current and recovery epochs, which is how we keep track of what hosts are allowed to perform recovery and when.

11.4.4 Restart NFS Ganesha Services Edit source

On both NFS Ganesha nodes, restart the related services:

root # systemctl restart nfs-ganesha.service

After the services are restarted, check the grace database:

cephadm > ganesha-rados-grace -p cephfs_metadata -n ganesha
cur=3 rec=0
======================================================
ses6min1.example.com
ses6min2.example.com
Note
Note: Cleared the 'E' Flag

Note that both nodes have cleared their 'E' flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.

11.4.5 Conclusion Edit source

After you complete all the preceding steps, you can mount the exported NFS from either of the two NFS Ganesha servers, and perform normal NFS operations against them.

Our example configuration assumes that if one of the two NFS Ganesha servers goes down, that you will restart it manually within 5 minutes. After 5 minutes, the Metadata Server may cancel the session that the NFS Ganesha client held and all of the state associated with it. If the session’s capabilities get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.

11.5 More Information Edit source

More information can be found in Book “Administration Guide”, Chapter 19 “NFS Ganesha: Export Ceph Data via NFS”.

Part IV Cluster Deployment on top of SUSE CaaS Platform 4 (Technology Preview) Edit source

12 SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes Cluster

This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster.

12 SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes Cluster Edit source

Warning
Warning: Technology Preview

Running containerized Ceph cluster on SUSE CaaS Platform is a technology preview. Do not deploy on a production Kubernetes cluster.

This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster.

12.1 Considerations Edit source

Before you start deploying, consider the following points:

  • To run Ceph in Kubernetes, SUSE Enterprise Storage 6 uses an upstream project called Rook (https://rook.io/).

  • Depending on the configuration, Rook may consume all unused disks on all nodes in a Kubernetes cluster.

  • The setup requires privileged containers.

12.2 Prerequisites Edit source

Before you start deploying, you need to have:

  • A running SUSE CaaS Platform 4 cluster.

  • SUSE CaaS Platform 4 worker nodes with a number of extra disks attached as a storage for the Ceph cluster.

12.3 Get Rook Manifests Edit source

The Rook orchestrator uses configuration files in YAML format called manifests. The manifests you need are included in the rook-k8s-yaml RPM package. Install it by running

root # zypper install rook-k8s-yaml

12.4 Installation Edit source

Rook-Ceph includes two main components: the 'operator' which is run by Kubernetes and allows creation of Ceph clusters, and the Ceph 'cluster' itself which is created and partially managed by the operator.

12.4.1 Configuration Edit source

12.4.1.1 Global Configuration Edit source

The manifests used in this setup install all Rook and Ceph components in the 'rook-ceph' namespace. If you need to change it, adopt all references to the namespace in the Kubernetes manifests accordingly.

Depending on which features of Rook you intend to use, alter the 'Pod Security Policy' configuration in common.yaml to limit Rook's security requirements. Follow the comments in the manifest file.

12.4.1.2 Operator Configuration Edit source

The manifest operator.yaml configures the Rook operator. Normally, you do not need to change it. Find more information following the comments in the manifest file.

12.4.1.3 Ceph Cluster Configuration Edit source

The manifest cluster.yaml is responsible for configuring the actual Ceph cluster which will run in Kubernetes. Find detailed description of all available options in the upstream Rook documentation at https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html.

By default, Rook is configured to use all nodes which are not tainted with node-role.kubernetes.io/master:NoSchedule and will obey configured placement settings (see https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings). The following example disables such behavior and only uses the nodes explicitly listed in the nodes section:

storage:
  useAllNodes: false
  nodes:
    - name: caasp4-worker-0
    - name: caasp4-worker-1
    - name: caasp4-worker-2
Note
Note

By default, Rook is configured to use all free and empty disks on each node for use as a Ceph storage.

12.4.1.4 Documentation Edit source

12.4.2 Create the Rook Operator Edit source

Install the Rook-Ceph common components, CSI roles, and the Rook-Ceph operator by executing the following command on the SUSE CaaS Platform master node:

root # kubectl apply -f common.yaml -f operator.yaml

common.yaml will create the 'rook-ceph' namespace, Ceph Custom Resource Definitions (CRDs) (see https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) to make Kubernetes aware of Ceph Objects (for example 'CephCluster'), and the RBAC roles and Pod Security Policies (see https://kubernetes.io/docs/concepts/policy/pod-security-policy/) which are necessary for allowing Rook to manage the cluster-specific resources.

Tip
Tip: hostNetwork and hostPorts Usage

Allowing the usage of hostNetwork is required when using hostNetwork: true in the Cluster Resource Definition. Allowing the usage of hostPorts in the PodSecurityPolicy is also required.

Verify the installation by running kubectl get pods -n rook-ceph on SUSE CaaS Platform master node, for example:

root # kubectl get pods -n rook-ceph
NAME                                     READY   STATUS      RESTARTS   AGE
rook-ceph-agent-57c9j                    1/1     Running     0          22h
rook-ceph-agent-b9j4x                    1/1     Running     0          22h
rook-ceph-operator-cf6fb96-lhbj7         1/1     Running     0          22h
rook-discover-mb8gv                      1/1     Running     0          22h
rook-discover-tztz4                      1/1     Running     0          22h

12.4.3 Create the Ceph Cluster Edit source

After you modify cluster.yaml to your needs, you can create the Ceph cluster. Run the following command on the SUSE CaaS Platform master node:

root # kubectl apply -f cluster.yaml

Watch the 'rook-ceph' namespace to see the Ceph cluster begin created. You will see as many Ceph Monitors as configured in the cluster.yaml manifest (default is 3), one Ceph Manager, and as many Ceph OSDs as you have free disks.

Tip
Tip: Temporary OSD Pods

While bootstrapping the Ceph cluster, you will see some pods with the name rook-ceph-osd-prepare-NODE-NAME run for a while and then terminate with the status 'Completed'. As their name implies, these pods provision Ceph OSDs. They are left without being deleted so that you can inspect their logs after their termination. For example:

root # kubectl get pods --namespace rook-ceph
NAME                                         READY  STATUS     RESTARTS  AGE
rook-ceph-agent-57c9j                        1/1    Running    0         22h
rook-ceph-agent-b9j4x                        1/1    Running    0         22h
rook-ceph-mgr-a-6d48564b84-k7dft             1/1    Running    0         22h
rook-ceph-mon-a-cc44b479-5qvdb               1/1    Running    0         22h
rook-ceph-mon-b-6c6565ff48-gm9wz             1/1    Running    0         22h
rook-ceph-operator-cf6fb96-lhbj7             1/1    Running    0         22h
rook-ceph-osd-0-57bf997cbd-4wspg             1/1    Running    0         22h
rook-ceph-osd-1-54cf468bf8-z8jhp             1/1    Running    0         22h
rook-ceph-osd-prepare-caasp4-worker-0-f2tmw  0/2    Completed  0         9m35s
rook-ceph-osd-prepare-caasp4-worker-1-qsfhz  0/2    Completed  0         9m33s
rook-ceph-tools-76c7d559b6-64rkw             1/1    Running    0         22h
rook-discover-mb8gv                          1/1    Running    0         22h
rook-discover-tztz4                          1/1    Running    0         22h

12.5 Using Rook as Storage for Kubernetes Workload Edit source

Rook allows you to use three different types of storage:

Object Storage

Object storage exposes an S3 API to the storage cluster for applications to put and get data. Refer to https://rook.io/docs/rook/v1.0/ceph-object.html for a detailed description.

Shared File System

A shared file system can be mounted with read/write permission from multiple pods. This is useful for applications that are clustered using a shared file system. Refer to https://rook.io/docs/rook/v1.0/ceph-filesystem.html for a detailed description.

Block Storage

Block storage allows you to mount storage to a single pod. Refer to https://rook.io/docs/rook/v1.0/ceph-block.html for a detailed description.

12.6 Uninstalling Rook Edit source

To uninstall Rook, follow these steps:

  1. Delete any Kubernetes applications which are consuming Rook storage.

  2. Delete all object, file, and/or block storage artifacts that you created by following Section 12.5, “Using Rook as Storage for Kubernetes Workload”.

  3. Delete the Ceph cluster, operator, and related resources:

    root # kubectl delete -f cluster.yaml
    root # kubectl delete -f operator.yaml
    root # kubectl delete -f common.yaml
  4. Delete the data on hosts:

    root # rm -rf /var/lib/rook
  5. If necessary, wipe the disks that were used by Rook. Refer to https://rook.io/docs/rook/master/ceph-teardown.html for more details.

A Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases Edit source

Several key packages in SUSE Enterprise Storage 6 are based on the Nautilus release series of Ceph. When the Ceph project (https://github.com/ceph/ceph) publishes new point releases in the Nautilus series. SUSE Enterprise Storage 6 is updated to ensure that the product benefits from the latest upstream bugfixes and feature backports.

This chapter contains summaries of notable changes contained in each upstream point release that has been—or is planned to be—included in the product.

Nautilus 14.2.4 Point ReleaseEdit source

This point release fixes a serious regression that found its way into the 14.2.3 point release. This regression did not affect SUSE Enterprise Storage customers because we did not ship a version based on 14.2.3.

Nautilus 14.2.3 Point ReleaseEdit source

  • Fixed a denial of service vulnerability where an unauthenticated client of Ceph Object Gateway could trigger a crash from an uncaught exception.

  • Nautilus-based librbd clients can now open images on Jewel clusters.

  • The Object Gateway num_rados_handles has been removed. If you were using a value of num_rados_handles greater than 1, multiply your current objecter_inflight_ops and objecter_inflight_op_bytes parameters by the old num_rados_handles to get the same throttle behavior.

  • The secure mode of Messenger v2 protocol is no longer experimental with this release. This mode is now the preferred mode of connection for monitors.

  • osd_deep_scrub_large_omap_object_key_threshold has been lowered to detect an object with large number of omap keys more easily.

  • The Ceph Dashboard now supports silencing Prometheus notifications.

Nautilus 14.2.2 Point ReleaseEdit source

  • The no{up,down,in,out} related commands have been revamped. There are now two ways to set the no{up,down,in,out} flags: the old command

    ceph osd [un]set FLAG

    which sets cluster-wide flags; and the new command

    ceph osd [un]set-group FLAGS WHO

    which sets flags in batch at the granularity of any crush node, or device class.

  • radosgw-admin introduces two subcommands that allow the managing of expire-stale objects that might be left behind after a bucket reshard in earlier versions of Object Gateway. One subcommand lists such objects and the other deletes them.

  • Earlier Nautilus releases (14.2.1 and 14.2.0) have an issue where deploying a single new Nautilus BlueStore OSD on an upgraded cluster (i.e. one that was originally deployed pre-Nautilus) breaks the pool utilization statistics reported by ceph df. Until all OSDs have been reprovisioned or updated (via ceph-bluestore-tool repair), the pool statistics will show values that are lower than the true value. This is resolved in 14.2.2, such that the cluster only switches to using the more accurate per-pool stats after all OSDs are 14.2.2 or later, are Block Storage, and have been updated via the repair function if they were created prior to Nautilus.

  • The default value for mon_crush_min_required_version has been changed from firefly to hammer, which means the cluster will issue a health warning if your CRUSH tunables are older than Hammer. There is generally a small (but non-zero) amount of data that will move around by making the switch to Hammer tunables.

    If possible, we recommend that you set the oldest allowed client to hammer or later. To display what the current oldest allowed client is, run:

    cephadm > ceph osd dump | grep min_compat_client

    If the current value is older than hammer, run the following command to determine whether it is safe to make this change by verifying that there are no clients older than Hammer currently connected to the cluster:

    cephadm > ceph features

    The newer straw2 CRUSH bucket type was introduced in Hammer. If you verify that all clients are Hammer or newer, it allows new features only supported for straw2 buckets to be used, including the crush-compat mode for the Balancer (Book “Administration Guide”, Chapter 8 “Ceph Manager Modules”, Section 8.1 “Balancer”).

Find detailed information about the patch at https://download.suse.com/Download?buildid=D38A7mekBz4~

Nautilus 14.2.1 Point ReleaseEdit source

This was the first point release following the original Nautilus release (14.2.0). The original ('General Availability' or 'GA') version of SUSE Enterprise Storage 6 was based on this point release.

Glossary Edit source

General

Admin node

The node from which you run the ceph-deploy utility to deploy Ceph on OSD nodes.

Bucket

A point which aggregates other nodes into a hierarchy of physical locations.

Important
Important: Do Not Mix with S3 Buckets

S3 buckets or containers represent different terms meaning folders for storing objects.

CRUSH, CRUSH Map

Controlled Replication Under Scalable Hashing: An algorithm that determines how to store and retrieve data by computing data storage locations. CRUSH requires a map of the cluster to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

Monitor node, MON

A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.

Node

Any single machine or server in a Ceph cluster.

OSD

Depending on context, Object Storage Device or Object Storage Daemon. The ceph-osd daemon is the component of Ceph that is responsible for storing objects on a local file system and providing access to them over the network.

OSD node

A cluster node that stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.

PG

Placement Group: a sub-division of a pool, used for performance tuning.

Pool

Logical partitions for storing objects such as disk images.

Rule Set

Rules to determine data placement for a pool.

Ceph Specific Terms

Ceph Storage Cluster

The core set of storage software which stores the user’s data. Such a set consists of Ceph monitors and OSDs.

AKA Ceph Object Store.

Object Gateway Specific Terms

archive sync module

Module that enables creating an Object Gateway zone for keeping the history of S3 object versions.

Object Gateway

The S3/Swift gateway component for Ceph Object Store.

B Documentation Updates Edit source

This chapter lists content changes for this document since the release of the latest maintenance update of SUSE Enterprise Storage 5. You can find changes related to the cluster deployment that apply to previous versions in https://www.suse.com/documentation/suse-enterprise-storage-5/book_storage_deployment/data/ap_deploy_docupdate.html.

The document was updated on the following dates:

B.1 2019 (Maintenance update of SUSE Enterprise Storage 6 documentation) Edit source

General Updates
Bugfixes

B.2 June, 2019 (Release of SUSE Enterprise Storage 6) Edit source

General Updates
Bugfixes
Print this page