7 Ceph cluster custom resource definitions #
7.1 Ceph cluster CRD #
Rook allows the creation and customization of storage clusters through Custom Resource Definitions (CRDs). There are two different methods of cluster creation, depending on whether the storage on which to base the Ceph cluster can be dynamically provisioned.
- Specify the host paths and raw devices. 
- Specify the storage class Rook should use to consume storage via PVCs. 
Examples for each of these approaches follow.
7.1.1 Host-based cluster #
To get you started, here is a simple example of a CRD to configure a Ceph cluster with all nodes and all devices. In the next example, the MONs and OSDs are backed by PVCs.
     In addition to your CephCluster object, you need to create the namespace,
     service accounts, and RBAC rules for the namespace in which you will
     create the CephCluster. These resources are defined in the example
     common.yaml file.
    
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    # see the "Cluster Settings" section below for more details on which image of Ceph to run
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: true
  storage:
    useAllNodes: true
    useAllDevices: true7.1.2 PVC-based cluster #
Kubernetes version 1.13.0 or greater is required to provision OSDs on PVCs.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    # see the "Cluster Settings" section below for more details on which image of Ceph to run
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    volumeClaimTemplate:
      spec:
        storageClassName: local-storage
        resources:
          requests:
            storage: 10Gi
  storage:
   storageClassDeviceSets:
    - name: set1
      count: 3
      portable: false
      tuneDeviceClass: false
      encrypted: false
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 10Gi
          # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2)
          storageClassName: local-storage
          volumeMode: Block
          accessModes:
            - ReadWriteOnceFor more advanced scenarios, such as adding a dedicated device, please refer to Section 7.1.4.8, “Dedicated metadata and WAL device for OSD on PVC”.
7.1.3 Settings #
Settings can be specified at the global level to apply to the cluster as a whole, while other settings can be specified at more fine-grained levels. If any setting is unspecified, a suitable default will be used automatically.
7.1.3.1 Cluster metadata #
- name: The name that will be used internally for the Ceph cluster. Most commonly, the name is the same as the namespace since multiple clusters are not supported in the same namespace.
- namespace: The Kubernetes namespace that will be created for the Rook cluster. The services, pods, and other resources created by the operator will be added to this namespace. The common scenario is to create a single Rook cluster. If multiple clusters are created, they must not have conflicting devices or host paths.
7.1.3.2 Cluster settings #
- external:- enable: if- true, the cluster will not be managed by Rook but via an external entity. This mode is intended to connect to an existing cluster. In this case, Rook will only consume the external cluster. However, if an image is provided, Rook will be able to deploy various daemons in Kubernetes, such as object gateways, MDS and NFS. If an image is not provided, it will refuse. If this setting is enabled, all the other options will be ignored except- cephVersion.imageand- dataDirHostPath. See Section 7.1.4.9, “External cluster”. If- cephVersion.imageis left blank, Rook will refuse the creation of extra CRs such as object, file and NFS.
 
- cephVersion: The version information for launching the Ceph daemons.- image: The image used for running the Ceph daemons. For example,- ceph/ceph:v16.2.7or- ceph/ceph:v15.2.4. To ensure a consistent version of the image is running across all nodes in the cluster, we recommend to use a very specific image version. Tags also exist that would give the latest version, but they are only recommended for test environments. Using the- v14or similar tag is not recommended in production because it may lead to inconsistent versions of the image running across different nodes in the cluster.
 
- dataDirHostPath: The path on the host where config and data should be stored for each of the services. If the directory does not exist, it will be created. Because this directory persists on the host, it will remain after pods are deleted. You must not use the following paths and any of their subpaths:- /etc/ceph,- /rookor- /var/log/ceph.- On Minikube environments, use - /data/rook. Minikube boots into a- tmpfsbut it provides some directories where files can persist across reboots. Using one of these directories will ensure that Rook’s data and configuration files persist and that enough storage space is available.Warning- WARNING: For test scenarios, if you delete a cluster and start a new cluster on the same hosts, the path used by - dataDirHostPathmust be deleted. Otherwise, stale keys and other configuration will remain from the previous cluster and the new MONs will fail to start. If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node.
 
- continueUpgradeAfterChecksEvenIfNotHealthy: if set to- true, Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
- dashboard: Settings for the Ceph Dashboard. To view the dashboard in your browser, see 第 I 部分 “Ceph Dashboard”.- enabled: Whether to enable the dashboard to view cluster status.
- urlPrefix: Allows serving the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy).
- port: Allows changing the default port where the dashboard is served.
- ssl: Whether to serve the dashboard via SSL; ignored on Ceph versions older than- 13.2.2.
 
- monitoring: Settings for monitoring Ceph using Prometheus. To enable monitoring on your cluster, see the 第 16 章 “监控和告警”.- enabled: Whether to enable-Prometheus based monitoring for this cluster.
- rulesNamespace: Namespace to deploy- prometheusRule. If empty, the namespace of the cluster will be used. We recommend:- If you have a single Rook Ceph cluster, set the - rulesNamespaceto the same namespace as the cluster, or leave it empty.
- If you have multiple Rook Ceph clusters in the same Kubernetes cluster, choose the same namespace to set - rulesNamespacefor all the clusters (ideally, namespace with Prometheus deployed). Otherwise, you will get duplicate alerts with duplicate alert definitions.
 
 
- network: For the network settings for the cluster, refer to Section 7.1.3.5, “Network configuration settings”.
- mon: contains MON related options Section 7.1.3.3, “MON settings”.
- mgr: manager top level section.- modules: is the list of Ceph Manager modules to enable.
 
- crashCollector: The settings for crash collector daemon(s).- disable: if set to- true, the crash collector will not run on any node where a Ceph daemon runs.
 
- annotations: Section 7.1.3.10, “Annotations and labels”
- placement: Section 7.1.3.11, “Placement configuration settings”
- resources: Section 7.1.3.12, “Cluster-wide resources configuration settings”
- priorityClassNames: Section 7.1.3.14, “Priority class names configuration settings”
- storage: Storage selection and configuration that will be used across the cluster. Note that these settings can be overridden for specific nodes.- useAllNodes:- trueor- false, indicating if all nodes in the cluster should be used for storage according to the cluster level storage selection and configuration values. If individual nodes are specified under the- nodesfield, then- useAllNodesmust be set to- false.
- nodes: Names of individual nodes in the cluster that should have their storage included in accordance with either the cluster level configuration specified above or any node specific overrides described in the next section below.- useAllNodesmust be set to- falseto use specific nodes and their configuration. See Section 7.1.3.6, “Node settings” below.
- config: Config settings applied to all OSDs on the node unless overridden by- devices.
 
- disruptionManagement: The section for configuring management of daemon disruptions- managePodBudgets: if- true, the operator will create and manage- PodDisruptionBudgetsfor OSD, MON, RGW, and MDS daemons. The operator will block eviction of OSDs by default and unblock them safely when drains are detected.
- osdMaintenanceTimeout: is a duration in minutes that determines how long an entire failure domain like- region/zone/hostwill be held in- noout(in addition to the default DOWN/OUT interval) when it is draining. This is only relevant when- managePodBudgetsis- true. The default value is- 30minutes.
- manageMachineDisruptionBudgets: if- true, the operator will create and manage- MachineDisruptionBudgetsto ensure OSDs are only fenced when the cluster is healthy. Only available on OpenShift.
- machineDisruptionBudgetNamespace: the namespace in which to watch the- MachineDisruptionBudgets.
 
- removeOSDsIfOutAndSafeToRemove: If- truethe operator will remove the OSDs that are down and whose data has been restored to other OSDs.
- cleanupPolicy: Section 7.1.4.10, “Cleanup policy”
7.1.3.3 MON settings #
- count: Set the number of MONs to be started. This should be an odd number between one and nine. If not specified, the default is set to three, and- allowMultiplePerNodeis also set to- true.
- allowMultiplePerNode: Enable (- true) or disable (- false) the placement of multiple MONs on one node. Default is- false.
- volumeClaimTemplate: A- PersistentVolumeSpecused by Rook to create PVCs for monitor storage. This field is optional, and when not provided, HostPath volume mounts are used. The current set of fields from template that are used are- storageClassNameand the- storageresource request and limit. The default storage size request for new PVCs is- 10Gi. Ensure that associated storage class is configured to use- volumeBindingMode: WaitForFirstConsumer. This setting only applies to new monitors that are created when the requested number of monitors increases, or when a monitor fails and is recreated.
If these settings are changed in the CRD, the operator will update the number of MONs during a periodic check of the MON health, which by default is every 45 seconds.
To change the defaults that the operator uses to determine the MON health and whether to failover a MON, refer to the Section 7.1.3.15, “Health settings”. The intervals should be small enough that you have confidence the MONs will maintain quorum, while also being long enough to ignore network blips where MONs are failed over too often.
7.1.3.4 Ceph Manager settings #
You can use the cluster CR to enable or disable any manager module. For example, this can be configured:
mgr:
  modules:
  - name: <name of the module>
    enabled: true
     Some modules will have special configuration to ensure the module is fully
     functional after being enabled. Specifically, the
     pg_autoscaler—Rook will configure all new pools
     with PG autoscaling by setting:
     osd_pool_default_pg_autoscale_mode = on
    
7.1.3.5 Network configuration settings #
If not specified, the default SDN will be used. Configure the network that will be enabled for the cluster and services.
- provider: Specifies the network provider that will be used to connect the network interface.
- selectors: List the network selector(s) that will be used associated by a key.
Changing networking configuration after a Ceph cluster has been deployed is not supported and will result in a non-functioning cluster.
     To use host networking, set provider: host.
    
7.1.3.6 Node settings #
In addition to the cluster level settings specified above, each individual node can also specify configuration to override the cluster level settings and defaults. If a node does not specify any configuration, then it will inherit the cluster level settings.
- name: The name of the node, which should match its- kubernetes.io/hostnamelabel.
- config: Configuration settings applied to all OSDs on the node unless overridden by- devices.
     When useAllNodes is set to true,
     Rook attempts to make Ceph cluster management as hands-off as possible
     while still maintaining reasonable data safety. If a usable node comes
     online, Rook will begin to use it automatically. To maintain a balance
     between hands-off usability and data safety, nodes are removed from Ceph
     as OSD hosts only (1) if the node is deleted from Kubernetes itself or (2) if
     the node has its taints or affinities modified in such a way that the node
     is no longer usable by Rook. Any changes to taints or affinities,
     intentional or unintentional, may affect the data reliability of the
     Ceph cluster. In order to help protect against this somewhat, deletion
     of nodes by taint or affinity modifications must be confirmed by deleting
     the Rook-Ceph operator pod and allowing the operator deployment to
     restart the pod.
    
     For production clusters, we recommend that useAllNodes
     is set to false to prevent the Ceph cluster from
     suffering reduced data reliability unintentionally due to a user mistake.
     When useAllNodes is set to false,
     Rook relies on the user to be explicit about when nodes are added to or
     removed from the Ceph cluster. Nodes are only added to the Ceph
     cluster if the node is added to the Ceph cluster resource. Similarly,
     nodes are only removed if the node is removed from the Ceph cluster
     resource.
    
7.1.3.6.1 Node updates #
Nodes can be added and removed over time by updating the cluster CRD —for example, with the following command:
kubectl -n rook-ceph edit cephcluster rook-ceph
      This will bring up your default text editor and allow you to add and
      remove storage nodes from the cluster. This feature is only available
      when useAllNodes has been set to
      false.
     
7.1.3.7 Storage selection settings #
Below are the settings available, both at the cluster and individual node level, for selecting which storage resources will be included in the cluster.
- useAllDevices:- trueor- false, indicating whether all devices found on nodes in the cluster should be automatically consumed by OSDs. This is Not recommended unless you have a very controlled environment where you will not risk formatting of devices with existing data. When- true, all devices/partitions will be used. Is overridden by- deviceFilterif specified.
- deviceFilter: A regular expression for short kernel names of devices (for example,- sda) that allows selection of devices to be consumed by OSDs. If individual devices have been specified for a node then this filter will be ignored. For example:- sdb: Selects only the- sdbdevice (if found).
- ^sd: Selects all devices starting with- sd.
- ^sd[a-d]: Selects devices starting with- sda,- sdb,- sdc, and- sdd(if found).
- ^s: Selects all devices that start with- s.
- ^[^r]: Selects all devices that do not start with- r
 
- devicePathFilter: A regular expression for device paths (for example,- /dev/disk/by-path/pci-0:1:2:3-scsi-1) that allows selection of devices to be consumed by OSDs. If individual devices or- deviceFilterhave been specified for a node then this filter will be ignored. For example:- ^/dev/sd.: Selects all devices starting with- sd
- ^/dev/disk/by-path/pci-.*: Selects all devices which are connected to PCI bus
 
- devices: A list of individual device names belonging to this node to include in the storage cluster.- name: The name of the device (for example,- sda), or full udev path (such as,- /dev/disk/by-id/ata-ST4000DM004-XXXX— this will not change after reboots).
- config: Device-specific configuration settings.
 
- storageClassDeviceSets: Explained in Section 7.1.3.8, “Storage class device sets”.
7.1.3.8 Storage class device sets #
The following are the settings for Storage Class Device Sets which can be configured to create OSDs that are backed by block mode PVs.
- name: A name for the set.
- count: The number of devices in the set.
- resources: The CPU and RAM requests or limits for the devices (optional).
- placement: The placement criteria for the devices (optional; default is no placement criteria).- The syntax is the same as for Section 7.1.3.11, “Placement configuration settings”. It supports - nodeAffinity,- podAffinity,- podAntiAffinityand- tolerationskeys.- We recommend configuring the placement such that the OSDs will be as evenly spread across nodes as possible. At a minimum, anti-affinity should be added, so at least one OSD will be placed on each available node. - However, if there are more OSDs than nodes, this anti-affinity will not be effective. Another placement scheme to consider is adding labels to the nodes in such a way that the OSDs can be grouped on those nodes, create multiple - storageClassDeviceSets, and add node affinity to each of the device sets that will place the OSDs in those sets of nodes.
- preparePlacement: The placement criteria for the preparation of the OSD devices. Creating OSDs is a two-step process and the prepare job may require different placement than the OSD daemons. If the- preparePlacementis not specified, the- placementwill instead be applied for consistent placement for the OSD prepare jobs and OSD deployments. The- preparePlacementis only useful for- portableOSDs in the device sets. OSDs that are not portable will be tied to the host where the OSD prepare job initially runs.- For example, provisioning may require topology spread constraints across zones, but the OSD daemons may require constraints across hosts within the zones. 
 
- portable: If- true, the OSDs will be allowed to move between nodes during failover. This requires a storage class that supports portability (for example,- aws-ebs, but not the local storage provisioner). If- false, the OSDs will be assigned to a node permanently. Rook will configure Ceph’s CRUSH map to support the portability.
- tuneDeviceClass: If- true, because the OSD can be on a slow device class, Rook will adapt to that by tuning the OSD process. This will make Ceph perform better under that slow device.
- volumeClaimTemplates: A list of PVC templates to use for provisioning the underlying storage devices.- resources.requests.storage: The desired capacity for the underlying storage devices.
- storageClassName: The StorageClass to provision PVCs from. The default is to use the cluster-default StorageClass. This StorageClass should provide a raw block device, multipath device, or logical volume. Other types are not supported. If you want to use logical volumes, please see the known issue of OSD on LV-backed PVC: https://github.com/rook/rook/blob/master/Documentation/ceph-common-issues.md#lvm-metadata-can-be-corrupted-with-osd-on-lv-backed-pvc
- volumeMode: The volume mode to be set for the PVC.
- accessModes: The access mode for the PVC to be bound by OSD.
 
- schedulerName: Scheduler name for OSD pod placement (optional).
- encrypted: whether to encrypt all the OSDs in a given storageClassDeviceSet.
7.1.3.9 Storage selection via Ceph DriveGroups #
Ceph DriveGroups allow for specifying highly advanced OSD layouts. Refer to 第 13.4.3 节 “使用 DriveGroups 规范添加 OSD。” for both general information and detailed specification of DriveGroups with useful examples.
      When managing a Rook/Ceph cluster’s OSD layouts with DriveGroups, the
      storage configuration is mostly ignored.
      storageClassDeviceSets can still be used to create
      OSDs on PVC, but Rook will no longer use storage
      configurations for creating OSDs on a node's devices. To avoid confusion,
      we recommend using the storage configuration
      or DriveGroups, but never both.
      Because storage and DriveGroups
      should not be used simultaneously, Rook only supports provisioning OSDs
      with DriveGroups on new Rook-Ceph clusters.
     
DriveGroups are defined by a name, a Ceph DriveGroups spec, and a Rook placement.
- name: A name for the DriveGroups.
- spec: The Ceph DriveGroups spec. Some components of the spec are treated differently in the context of Rook as noted below.- Rook overrides Ceph’s definition of - placementin order to use Rook’s- placementbelow.
- Rook overrides Ceph’s - service_idfield to be the same as the DriveGroups- nameabove.
 
- placement: The placement criteria for nodes to provision with the DriveGroups (optional; default is no placement criteria, which matches all untainted nodes). The syntax is the same as for Section 7.1.3.11, “Placement configuration settings”.
7.1.3.10 Annotations and labels #
Annotations and Labels can be specified so that the Rook components will have those annotations or labels added to them.
You can set annotations and labels for Rook components for the list of key value pairs:
- all: Set annotations / labels for all components
- mgr: Set annotations / labels for MGRs
- mon: Set annotations / labels for MONs
- osd: Set annotations / labels for OSDs
- prepareosd: Set annotations / labels for OSD Prepare Jobs
     When other keys are set, all will be merged together
     with the specific component.
    
7.1.3.11 Placement configuration settings #
     Placement configuration for the cluster services. It includes the
     following keys: mgr, mon,
     osd, cleanup, and
     all. Each service will have its placement configuration
     generated by merging the generic configuration under
     all with the most specific one (which will override any
     attributes).
    
      Placement of OSD pods is controlled using the
      Section 7.1.3.8, “Storage class device sets”, not the general
      placement configuration.
     
A placement configuration is specified (according to the Kubernetes PodSpec) as:
- nodeAffinity
- podAffinity
- podAntiAffinity
- tolerations
- topologySpreadConstraints
     If you use labelSelector for OSD pods, you must write
     two rules both for rook-ceph-osd and
     rook-ceph-osd-prepare.
    
     The Rook Ceph operator creates a job called
     rook-ceph-detect-version to detect the full Ceph
     version used by the given cephVersion.image. The
     placement from the MON section is used for the job except for the
     PodAntiAffinity field.
    
7.1.3.12 Cluster-wide resources configuration settings #
Resources should be specified so that the Rook components are handled after Kubernetes Pod Quality of Service classes. This allows to keep Rook components running when for example a node runs out of memory and the Rook components are not killed depending on their Quality of Service class.
You can set resource requests/limits for Rook components through the Section 7.1.3.13, “Resource requirements and limits” structure in the following keys:
- mgr: Set resource requests/limits for MGRs.
- mon: Set resource requests/limits for MONs.
- osd: Set resource requests/limits for OSDs.
- prepareosd: Set resource requests/limits for OSD prepare job.
- crashcollector: Set resource requests and limits for crash. This pod runs wherever there is a Ceph pod running. It scrapes for Ceph daemon core dumps and sends them to the Ceph manager crash module so that core dumps are centralized and can be easily listed/accessed.
- cleanup: Set resource requests and limits for cleanup job, responsible for wiping cluster’s data after uninstall.
In order to provide the best possible experience running Ceph in containers, Rook internally recommends minimum memory limits if resource limits are passed. If a user configures a limit or request value that is too low, Rook will still run the pod(s) and print a warning to the operator log.
- mon: 1024 MB
- mgr: 512 MB
- osd: 2048 MB
- mds: 4096 MB
- prepareosd: 50 MB
- crashcollector: 60MB
7.1.3.13 Resource requirements and limits #
- requests: Requests for CPU or memory.- cpu: Request for CPU (example: one CPU core- 1, 50% of one CPU core- 500m).
- memory: Limit for Memory (example: one gigabyte of memory- 1Gi, half a gigabyte of memory- 512Mi).
 
- limits: Limits for CPU or memory.- cpu: Limit for CPU (example: one CPU core- 1, 50% of one CPU core- 500m).
- memory: Limit for Memory (example: one gigabyte of memory- 1Gi, half a gigabyte of memory- 512Mi).
 
7.1.3.14 Priority class names configuration settings #
Priority class names can be specified so that the Rook components will have those priority class names added to them.
You can set priority class names for Rook components for the list of key value pairs:
- all: Set priority class names for MGRs, MONs, OSDs.
- mgr: Set priority class names for MGRs.
- mon: Set priority class names for MONs.
- osd: Set priority class names for OSDs.
     The specific component keys will act as overrides to
     all.
    
7.1.3.15 Health settings #
Rook-Ceph will monitor the state of the CephCluster on various components by default. The following CRD settings are available:
- healthCheck: main Ceph cluster health monitoring section
Currently three health checks are implemented:
- mon: health check on the Ceph monitors. Basic check as to whether monitors are members of the quorum. If after a certain timeout a given monitor has not rejoined the quorum, it will be failed over and replaced by a new monitor.
- osd: health check on the Ceph OSDs.
- status: Ceph health status check; periodically checks the Ceph health state, and reflects it in the CephCluster CR status field.
     The liveness probe of each daemon can also be controlled via
     livenessProbe. The setting is valid for
     mon, mgr and osd.
     Here is a complete example for both daemonHealth and
     livenessProbe:
    
healthCheck:
  daemonHealth:
    mon:
      disabled: false
      interval: 45s
      timeout: 600s
    osd:
      disabled: false
      interval: 60s
    status:
      disabled: false
  livenessProbe:
    mon:
      disabled: false
    mgr:
      disabled: false
    osd:
      disabled: false
     You can change the mgr probe by applying the following:
    
healthCheck:
  livenessProbe:
    mgr:
      disabled: false
      probe:
        httpGet:
          path: /
          port: 9283
        initialDelaySeconds: 3
        periodSeconds: 3Changing the liveness probe is an advanced operation and should rarely be necessary. If you want to change these settings, start with the probe specification that Rook generates by default and then modify the desired settings.
7.1.4 Samples #
Here are several samples for configuring Ceph clusters. Each of the samples must also include the namespace and corresponding access granted for management by the Ceph operator. See the common cluster resources below.
7.1.4.1 Storage configuration: All devices #
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: true
  dashboard:
    enabled: true
  # cluster level storage configuration and selection
  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter:
    config:
      metadataDevice:
      databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
      journalSizeMB: "1024"  # this value can be removed for environments with normal sized disks (20 GB or larger)
      osdsPerDevice: "1"7.1.4.2 Storage configuration: Specific devices #
Individual nodes and their configurations can be specified so that only the named nodes below will be used as storage resources. Each node’s “name” field should match their “kubernetes.io/hostname” label.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: true
  dashboard:
    enabled: true
  # cluster level storage configuration and selection
  storage:
    useAllNodes: false
    useAllDevices: false
    deviceFilter:
    config:
      metadataDevice:
      databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
    nodes:
    - name: "172.17.4.201"
      devices:             # specific devices to use for storage can be specified for each node
      - name: "sdb" # Whole storage device
      - name: "sdc1" # One specific partition. Should not have a file system on it.
      - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # both device name and explicit udev links are supported
      config:         # configuration can be specified at the node level which overrides the cluster level config
        storeType: bluestore
    - name: "172.17.4.301"
      deviceFilter: "^sd."7.1.4.3 Node affinity #
To control where various services will be scheduled by Kubernetes, use the placement configuration sections below. The example under “all” would have all services scheduled on Kubernetes nodes labeled with “role=storage-node” and tolerate taints with a key of “storage-node”.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: true
  # enable the Ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: role
              operator: In
              values:
              - storage-node
      tolerations:
      - key: storage-node
        operator: Exists
    mgr:
      nodeAffinity:
      tolerations:
    mon:
      nodeAffinity:
      tolerations:
    osd:
      nodeAffinity:
      tolerations:7.1.4.4 Resource requests and limits #
     To control how many resources the Rook components can request/use, you
     can set requests and limits in Kubernetes for them. You can override these
     requests and limits for OSDs per node when using useAllNodes:
     false in the node item in the
     nodes list.
    
Before setting resource requests/limits, review the Ceph documentation for hardware recommendations for each component.
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: true
  # enable the Ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
  # cluster level resource requests/limits configuration
  resources:
  storage:
    useAllNodes: false
    nodes:
    - name: "172.17.4.201"
      resources:
        limits:
          cpu: "2"
          memory: "4096Mi"
        requests:
          cpu: "2"
          memory: "4096Mi"7.1.4.5 OSD topology #
The topology of the cluster is important in production environments where you want your data spread across failure domains. The topology can be controlled by adding labels to the nodes. When the labels are found on a node at first OSD deployment, Rook will add them to the desired level in the CRUSH map.
The complete list of labels in hierarchy order from highest to lowest is:
topology.kubernetes.io/region topology.kubernetes.io/zone topology.rook.io/datacenter topology.rook.io/room topology.rook.io/pod topology.rook.io/pdu topology.rook.io/row topology.rook.io/rack topology.rook.io/chassis
For example, if the following labels were added to a node:
kubectl label node mynode topology.kubernetes.io/zone=zone1 kubectl label node mynode topology.rook.io/rack=rack1
      For versions previous to K8s 1.17, use the topology key:
      failure-domain.beta.kubernetes.io/zone or region.
     
These labels would result in the following hierarchy for OSDs on that node (this command can be run in the Rook toolbox):
[root@mynode /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.01358 root default -5 0.01358 zone zone1 -4 0.01358 rack rack1 -3 0.01358 host mynode 0 hdd 0.00679 osd.0 up 1.00000 1.00000 1 hdd 0.00679 osd.1 up 1.00000 1.00000
Ceph requires unique names at every level in the hierarchy (CRUSH map). For example, you cannot have two racks with the same name that are in different zones. Racks in different zones must be named uniquely.
     Note that the host is added automatically to the
     hierarchy by Rook. The host cannot be specified with a topology label.
     All topology labels are optional.
    
      When setting the node labels prior to CephCluster
      creation, these settings take immediate effect. However, applying this to
      an already deployed CephCluster requires removing each
      node from the cluster first and then re-adding it with new configuration
      to take effect. Do this node by node to keep your data safe! Check the
      result with ceph osd tree from the
      Chapter 9, Toolboxes. The OSD tree should display
      the hierarchy for the nodes that already have been re-added.
     
     To utilize the failureDomain based on the node labels,
     specify the corresponding option in the CephBlockPool.
    
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: rack  # this matches the topology labels on nodes
  replicated:
    size: 3This configuration will split the replication of volumes across unique racks in the data center setup.
7.1.4.6 Using PVC storage for monitors #
     In the CRD specification below three monitors are created each using a
     10Gi PVC created by Rook using the local-storage
     storage class.
    
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v15.2.4
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
    volumeClaimTemplate:
      spec:
        storageClassName: local-storage
        resources:
          requests:
            storage: 10Gi
  dashboard:
    enabled: true
  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter:
    config:
      metadataDevice:
      databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
      journalSizeMB: "1024"  # this value can be removed for environments with normal sized disks (20 GB or larger)
      osdsPerDevice: "1"7.1.4.7 Using StorageClassDeviceSets #
     In the CRD specification below, three OSDs (having specific placement and
     resource values) and three MONs with each using a 10Gi PVC, are created by
     Rook using the local-storage storage class.
    
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
    volumeClaimTemplate:
      spec:
        storageClassName: local-storage
        resources:
          requests:
            storage: 10Gi
  cephVersion:
    image: ceph/ceph:v15.2.4
    allowUnsupported: false
  dashboard:
    enabled: true
  network:
    hostNetwork: false
  storage:
    storageClassDeviceSets:
    - name: set1
      count: 3
      portable: false
      tuneDeviceClass: false
      resources:
        limits:
          cpu: "500m"
          memory: "4Gi"
        requests:
          cpu: "500m"
          memory: "4Gi"
      placement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "rook.io/cluster"
                  operator: In
                  values:
                    - cluster1
                topologyKey: "topology.kubernetes.io/zone"
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 10Gi
          storageClassName: local-storage
          volumeMode: Block
          accessModes:
            - ReadWriteOnce7.1.4.8 Dedicated metadata and WAL device for OSD on PVC #
In the simplest case, Ceph OSD BlueStore consumes a single (primary) storage device. BlueStore is the engine used by the OSD to store data.
The storage device is normally used as a whole, occupying the full device that is managed directly by BlueStore. It is also possible to deploy BlueStore across additional devices such as a DB device. This device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the embedded RocksDB) will put as much metadata as it can on the DB device to improve performance. If the DB device fills up, metadata will spill back onto the primary device (where it would have been otherwise). Again, it is only helpful to provision a DB device if it is faster than the primary device.
     You can have multiple volumeClaimTemplates where each
     might either represent a device or a metadata device. So just taking the
     storage section this will give something like:
    
  storage:
   storageClassDeviceSets:
    - name: set1
      count: 3
      portable: false
      tuneDeviceClass: false
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 10Gi
          # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2)
          storageClassName: gp2
          volumeMode: Block
          accessModes:
            - ReadWriteOnce
      - metadata:
          name: metadata
        spec:
          resources:
            requests:
              # Find the right size https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
              storage: 5Gi
          # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1)
          storageClassName: io1
          volumeMode: Block
          accessModes:
            - ReadWriteOnceRook only supports three naming conventions for a given template:
- data: represents the main OSD block device, where your data is being stored. 
- metadata: represents the metadata (including - block.dband- block.wal) device used to store the Ceph Bluestore database for an OSD.
- “wal”: represents the - block.waldevice used to store the Ceph BlueStore database for an OSD. If this device is set, “metadata” device will refer specifically to the- block.dbdevice. It is recommended to use a faster storage class for the metadata or wal device, with a slower device for the data. Otherwise, having a separate metadata device will not improve the performance.
The BlueStore partition has the following reference combinations supported by the ceph-volume utility:
- A single “data” device. - storage: storageClassDeviceSets: - name: set1 count: 3 portable: false tuneDeviceClass: false volumeClaimTemplates: - metadata: name: data spec: resources: requests: storage: 10Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2) storageClassName: gp2 volumeMode: Block accessModes: - ReadWriteOnce
- A data device and a metadata device. - storage: storageClassDeviceSets: - name: set1 count: 3 portable: false tuneDeviceClass: false volumeClaimTemplates: - metadata: name: data spec: resources: requests: storage: 10Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2) storageClassName: gp2 volumeMode: Block accessModes: - ReadWriteOnce - metadata: name: metadata spec: resources: requests: # Find the right size https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing storage: 5Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1) storageClassName: io1 volumeMode: Block accessModes: - ReadWriteOnce
- A data device and a WAL device. A WAL device can be used for BlueStore’s internal journal or write-ahead log ( - block.wal). It is only useful to use a WAL device if the device is faster than the primary device (the data device). There is no separate metadata device in this case; the data of main OSD block and- block.dbare located in data device.- storage: storageClassDeviceSets: - name: set1 count: 3 portable: false tuneDeviceClass: false volumeClaimTemplates: - metadata: name: data spec: resources: requests: storage: 10Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2) storageClassName: gp2 volumeMode: Block accessModes: - ReadWriteOnce - metadata: name: wal spec: resources: requests: # Find the right size https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing storage: 5Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1) storageClassName: io1 volumeMode: Block accessModes: - ReadWriteOnce
- A data device, a metadata device and a wal device. - storage: storageClassDeviceSets: - name: set1 count: 3 portable: false tuneDeviceClass: false volumeClaimTemplates: - metadata: name: data spec: resources: requests: storage: 10Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2) storageClassName: gp2 volumeMode: Block accessModes: - ReadWriteOnce - metadata: name: metadata spec: resources: requests: # Find the right size https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing storage: 5Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1) storageClassName: io1 volumeMode: Block accessModes: - ReadWriteOnce - metadata: name: wal spec: resources: requests: # Find the right size https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing storage: 5Gi # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1) storageClassName: io1 volumeMode: Block accessModes: - ReadWriteOnce
With the present configuration, each OSD will have its main block allocated a 10 GB device as well a 5 GB device to act as a BlueStore database.
7.1.4.9 External cluster #
The minimum supported Ceph version for the External Cluster is Luminous 12.2.x.
The features available from the external cluster will vary depending on the version of Ceph. The following table shows the minimum version of Ceph for some of the features:
| FEATURE | CEPH VERSION | 
|---|---|
| Dynamic provisioning RBD | 12.2.X | 
| Configure extra CRDs (object, file, NFS)[a] | 13.2.3 | 
| Dynamic provisioning CephFS | 14.2.3 | 
| [a] Configure an object store, shared file system, or NFS resources in the local cluster to connect to the external Ceph cluster | |
7.1.4.9.1 Prerequisites #
      In order to configure an external Ceph cluster with Rook, we need to
      inject some information in order to connect to that cluster. You can use
      the
      cluster/examples/kubernetes/ceph/import-external-cluster.sh
      script to achieve that. The script will look for the following populated
      environment variables:
     
- NAMESPACE: the namespace where the configmap and secrets should be injected
- ROOK_EXTERNAL_FSID: the FSID of the external Ceph cluster. This can be retrieved via the- ceph fsidcommand.
- ROOK_EXTERNAL_CEPH_MON_DATA: this is a comma- separated list of running monitors' IP addresses along with their ports. For example,- a=172.17.0.4:6789,b=172.17.0.5:6789,c=172.17.0.6:6789. You do not need to specify all the monitors; you can simply pass one, and the operator will discover the rest. The name of the monitor is the name that appears in the- ceph statusoutput.
Now, we need to give Rook a key to connect to the cluster in order to perform various operations, such as cluster health checks, CSI keys management, etc. We recommend generating keys with minimal access, so the admin key does not need to be used by the external cluster. In this case, the admin key is only needed to generate the keys that will be used by the external cluster. If the admin key is to be used by the external cluster, however, set the following variable:
- ROOK_EXTERNAL_ADMIN_SECRET: OPTIONAL: the external Ceph cluster admin secret key. This can be retrieved via the- ceph auth get-key client.admincommand.
       WARNING: If you plan to create CRs
       (pool, rgw, mds, nfs) in the external cluster, you
       MUST inject the client.admin keyring
       as well as injecting cluster-external-management.yaml
      
Example:
export NAMESPACE=rook-ceph-external export ROOK_EXTERNAL_FSID=3240b4aa-ddbc-42ee-98ba-4ea7b2a61514 export ROOK_EXTERNAL_CEPH_MON_DATA=a=172.17.0.4:6789 export ROOK_EXTERNAL_ADMIN_SECRET=AQC6Ylxdja+NDBAAB7qy9MEAr4VLLq4dCIvxtg==
      If the Ceph admin key is not provided, the following script needs to be
      executed on a machine that can connect to the Ceph cluster using the
      Ceph admin key. On that machine, run
      cluster/examples/kubernetes/ceph/create-external-cluster-resources.sh.
      The script will automatically create users and keys with the lowest
      possible privileges and populate the necessary environment variables for
      cluster/examples/kubernetes/ceph/import-external-cluster.sh
      to work correctly.
     
Finally, execute the script like this from a machine that has access to your Kubernetes cluster:
bash cluster/examples/kubernetes/ceph/import-external-cluster.sh
7.1.4.9.2 CephCluster example (consumer) #
Assuming the above section has successfully completed, here is a CR example:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph-external
  namespace: rook-ceph-external
spec:
  external:
    enable: true
  crashCollector:
    disable: true
  # optionally, the ceph-mgr IP address can be pass to gather metric from the prometheus exporter
  #monitoring:
    #enabled: true
    #rulesNamespace: rook-ceph
    #externalMgrEndpoints:
      #- ip: 192.168.39.182
      Choose the namespace carefully; if you have an existing cluster managed
      by Rook, you have likely already injected
      common.yaml. Additionally, you need to inject
      common-external.yaml too.
     
You can now create it like this:
kubectl create -f cluster/examples/kubernetes/ceph/cluster-external.yaml
If the previous section has not been completed, the Rook Operator will still acknowledge the CR creation but will wait forever to receive connection information.
       If no cluster is managed by the current Rook Operator, you need to
       inject common.yaml, then modify
       cluster-external.yaml and specify
       rook-ceph as namespace.
      
      If this is successful, you will see the CephCluster status as 
      connected.
     
kubectl get CephCluster -n rook-ceph-external NAME DATADIRHOSTPATH MONCOUNT AGE STATE HEALTH rook-ceph-external /var/lib/rook 162m Connected HEALTH_OK
Before you create a StorageClass with this cluster you will need to create a pool in your external Ceph Cluster.
7.1.4.9.3 Example StorageClass based on external Ceph pool #
In the cluster, list the pools available:
rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR replicated_2g 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
      Here is an example StorageClass configuration that uses the
      replicated_2g pool from the external cluster:
     
cat << EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block-ext
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    clusterID: rook-ceph-external
    # Ceph pool into which the RBD image shall be created
    pool: replicated_2g
    # RBD image format. Defaults to "2".
    imageFormat: "2"
    # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering
    # The secrets contain Ceph admin credentials.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph-external
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph-external
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph-external
    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
    # in hyperconverged settings where the volume is mounted on the same node as the osds.
    csi.storage.k8s.io/fstype: ext4
# Delete the rbd volume when a PVC is deleted
reclaimPolicy: Delete
allowVolumeExpansion: true
EOFYou can now create a persistent volume based on this StorageClass.
7.1.4.9.4 CephCluster example (management) #
The following CephCluster CR represents a cluster that will perform management tasks on the external cluster. It will not only act as a consumer, but will also allow the deployment of other CRDs such as CephFilesystem or CephObjectStore. As mentioned above, you would need to inject the admin keyring for that.
The corresponding YAML example:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph-external
  namespace: rook-ceph-external
spec:
  external:
    enable: true
  dataDirHostPath: /var/lib/rook
  cephVersion:
    image: ceph/ceph:v15.2.4 # Should match external cluster version7.1.4.10 Cleanup policy #
     Rook has the ability to cleanup resources and data that were deployed
     when a delete cephcluster command is issued. The policy
     represents the confirmation that cluster data should be forcibly deleted.
     The cleanupPolicy should only be added to the cluster
     when the cluster is about to be deleted. After the
     confirmation field of the cleanup policy is set, Rook
     will stop configuring the cluster as if the cluster is about to be
     destroyed in order to prevent these settings from being deployed
     unintentionally. The cleanupPolicy CR settings has
     different fields:
    
- confirmation: Only an empty string and- yes-really-destroy-dataare valid values for this field. If an empty string is set, Rook will only remove Ceph’s metadata. A re-installation will not be possible unless the hosts are cleaned first. If- yes-really-destroy-datathe operator will automatically delete data on the hostpath of cluster nodes and clean devices with OSDs. The cluster can then be re-installed if desired with no further steps.
- sanitizeDisks: sanitizeDisks represents advanced settings that can be used to sanitize drives. This field only affects if- confirmationis set to- yes-really-destroy-data. However, the administrator might want to sanitize the drives in more depth with the following flags:- method: indicates whether the entire disk should be sanitized or Ceph metadata only. Possible choices are “quick” (default) or “complete”.
- dataSource: indicate where to get random bytes from to write on the disk. Possible choices are “zero” (default) or “random”. Using random sources will consume entropy from the system and will take much more time then the zero source.
- iteration: overwrite N times instead of the default (1). Takes an integer value.
 
- allowUninstallWithVolumes: If set to true, then the cephCluster deletion does not wait for the PVCs to be deleted. Default is- false.
To automate activation of the cleanup, you can use the following command:
Data will be permanently deleted.
kubectl -n rook-ceph patch cephcluster rook-ceph --type merge \
 -p '{"spec":{"cleanupPolicy":{"confirmation":"yes-really-destroy-data"}}}'Nothing will happen until the deletion of the CR is requested, so this can still be reverted. However, all new configuration by the operator will be blocked with this cleanup policy enabled.
     Rook waits for the deletion of PVs provisioned using the CephCluster
     before proceeding to delete the CephCluster. To force deletion of the
     CephCluster without waiting for the PVs to be deleted, you can set the
     allowUninstallWithVolumes to true
     under spec.CleanupPolicy.
    
7.2 Ceph block pool CRD #
Rook allows creation and customization of storage pools through the custom resource definitions (CRDs). The following settings are available for pools.
7.2.1 Samples #
7.2.1.1 Replicated #
For optimal performance, while also adding redundancy, this sample will configure Ceph to make three full copies of the data on multiple nodes.
This sample requires at least one OSD per node, with each OSD located on three different nodes.
     Each OSD must be located on a different node, because the
     failureDomain is set to host and the
     replicated.size is set to three.
    
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
  deviceClass: hdd7.2.1.2 Erasure coded #
This sample will lower the overall storage capacity requirement, while also adding redundancy by using Section 7.2.2.4, “Erasure coding”.
This sample requires at least three BlueStore OSDs.
     The OSDs can be located on a single Ceph node or spread across multiple
     nodes, because the failureDomain is set to
     osd and the erasureCoded chunk
     settings require at least three different OSDs (two
     dataChunks + one codingChunks).
    
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: ecpool
  namespace: rook-ceph
spec:
  failureDomain: osd
  erasureCoded:
    dataChunks: 2
    codingChunks: 1
  deviceClass: hddHigh performance applications typically will not use erasure coding due to the performance overhead of creating and distributing the chunks in the cluster.
When creating an erasure-coded pool, we recommend creating the pool when you have BlueStore OSDs in your cluster.
7.2.2 Pool settings #
7.2.2.1 Metadata #
- name: The name of the pool to create.
- namespace: The namespace of the Rook cluster where the pool is created.
7.2.2.2 Specification #
- replicated: Settings for a replicated pool. If specified,- erasureCodedsettings must not be specified.- size: The desired number of copies to make of the data in the pool.
- requireSafeReplicaSize: set to false if you want to create a pool with size one, setting pool size one could lead to data loss without recovery.
 
- erasureCoded: Settings for an erasure-coded pool. If specified,- replicatedsettings must not be specified. See below for more details on Section 7.2.2.4, “Erasure coding”.- dataChunks: Number of chunks to divide the original object into
- codingChunks: Number of coding chunks to generate
 
- failureDomain: The failure domain across which the data will be spread. This can be set to a value of either- osdor- host, with- hostbeing the default setting. A failure domain can also be set to a different type (for example,- rack), if it is added as a- locationStorage Selection Settings. If a replicated pool of size three is configured and the- failureDomainis set to- host, all three copies of the replicated data will be placed on OSDs located on three different Ceph hosts. This case is guaranteed to tolerate a failure of two hosts without a loss of data. Similarly, a failure domain set to- osd, can tolerate a loss of two OSD devices.- If erasure coding is used, the data and coding chunks are spread across the configured failure domain. Note- Neither Rook, nor Ceph, prevent the creation of a cluster where the replicated data (or erasure coded chunks) can be written safely. By design, Ceph will delay checking for suitable OSDs until a write request is made and this write can hang if there are not sufficient OSDs to satisfy the request. 
- deviceClass: Sets up the CRUSH rule for the pool to distribute data only on the specified device class. If left empty or unspecified, the pool will use the cluster’s default CRUSH root, which usually distributes data over all OSDs, regardless of their class.
- crushRoot: The root in the crush map to be used by the pool. If left empty or unspecified, the default root will be used. Creating a crush hierarchy for the OSDs currently requires the Rook toolbox to run the Ceph tools.
- enableRBDStats: Enables collecting RBD per-image IO statistics by enabling dynamic OSD performance counters. Defaults to- false.
- parameters: Sets any parameters listed to the given pool- target_size_ratio:gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool.
- compression_mode: Sets up the pool for inline compression when using a BlueStore OSD. If left unspecified does not setup any compression mode for the pool. Values supported are the same as BlueStore inline compression modes, such as- none,- passive,- aggressive, and- force.
 
7.2.2.3 Add specific pool properties #
     With poolProperties you can set any pool property:
    
spec:
  parameters:
    <name of the parameter>: <parameter value>For example:
spec:
  parameters:
    min_size: 17.2.2.4 Erasure coding #
Erasure coding allows you to keep your data safe while reducing the storage overhead. Instead of creating multiple replicas of the data, erasure coding divides the original data into chunks of equal size, then generates extra chunks of that same size for redundancy.
For example, if you have an object of size 2 MB, the simplest erasure coding with two data chunks would divide the object into two chunks of size 1 MB each (data chunks). One more chunk (coding chunk) of size 1 MB will be generated. In total, 3 MB will be stored in the cluster. The object will be able to suffer the loss of any one of the chunks and still be able to reconstruct the original object.
The number of data and coding chunks you choose will depend on your resiliency to loss and how much storage overhead is acceptable in your storage cluster. Here are some examples to illustrate how the number of chunks affects the storage and loss toleration.
| Data chunks (k) | Coding chunks (m) | Total storage | Losses Tolerated | OSDs required | 
|---|---|---|---|---|
| 2 | 1 | 1.5x | 1 | 3 | 
| 2 | 2 | 2x | 2 | 4 | 
| 4 | 2 | 1.5x | 2 | 6 | 
| 16 | 4 | 1.25x | 4 | 20 | 
     The failureDomain must be also be taken into account
     when determining the number of chunks. The failure domain determines the
     level in the Ceph CRUSH hierarchy where the chunks must be uniquely
     distributed. This decision will impact whether node losses or disk losses
     are tolerated. There could also be performance differences of placing the
     data across nodes or OSDs.
    
- host: All chunks will be placed on unique hosts
- osd: All chunks will be placed on unique OSDs
If you do not have a sufficient number of hosts or OSDs for unique placement the pool can be created, writing to the pool will hang.
     Rook currently only configures two levels in the CRUSH map. It is also
     possible to configure other levels such as rack with by
     adding topology labels to the nodes.