Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / SUSE Enterprise Storage 7.1 Documentation / Administration and Operations Guide / Storing Data in a Cluster / Stored data management
Applies to SUSE Enterprise Storage 7.1

17 Stored data management

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH requires a map of your cluster, and uses the CRUSH Map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

CRUSH maps contain a list of OSDs, a list of 'buckets' for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

After you deploy a Ceph cluster, a default CRUSH Map is generated. It is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH Map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.

For example, if an OSD goes down, a CRUSH Map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.

Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.

A custom CRUSH Map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) (refer to Section 17.4, “Placement groups”) associated with a failed host are in a degraded state.

There are three main sections to a CRUSH Map.

  • OSD devices consist of any object storage device corresponding to a ceph-osd daemon.

  • Buckets consist of a hierarchical aggregation of storage locations (for example rows, racks, hosts, etc.) and their assigned weights.

  • Rule sets consist of the manner of selecting buckets.

17.1 OSD devices

To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.

#devices
device NUM osd.OSD_NAME class CLASS_NAME

For example:

#devices
device 0 osd.0 class hdd
device 1 osd.1 class ssd
device 2 osd.2 class nvme
device 3 osd.3 class ssd

As a general rule, an OSD daemon maps to a single disk.

17.1.1 Device classes

The flexibility of the CRUSH Map in controlling data placement is one of the Ceph's strengths. It is also one of the most difficult parts of the cluster to manage. Device classes automate the most common changes to CRUSH Maps that the administrator needed to do manually previously.

17.1.1.1 The CRUSH management problem

Ceph clusters are frequently built with multiple types of storage devices: HDD, SSD, NVMe, or even mixed classes of the above. We call these different types of storage devices device classes to avoid confusion between the type property of CRUSH buckets (for example, host, rack, row, see Section 17.2, “Buckets” for more details). Ceph OSDs backed by SSDs are much faster than those backed by spinning disks, making them better suited for certain workloads. Ceph makes it easy to create RADOS pools for different data sets or workloads and to assign different CRUSH rules to control data placement for those pools.

OSDs with mixed device classes
Figure 17.1: OSDs with mixed device classes

However, setting up the CRUSH rules to place data only on a certain class of device is tedious. Rules work in terms of the CRUSH hierarchy, but if the devices are mixed into the same hosts or racks (as in the sample hierarchy above), they will (by default) be mixed together and appear in the same sub-trees of the hierarchy. Manually separating them out into separate trees involved creating multiple versions of each intermediate node for each device class in previous versions of SUSE Enterprise Storage.

17.1.1.2 Device classes

An elegant solution that Ceph offers is to add a property called device class to each OSD. By default, OSDs will automatically set their device classes to either 'hdd', 'ssd', or 'nvme' based on the hardware properties exposed by the Linux kernel. These device classes are reported in a new column of the ceph osd tree command output:

cephuser@adm > ceph osd tree
 ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       83.17899 root default
 -4       23.86200     host cpach
 2   hdd  1.81898         osd.2      up  1.00000 1.00000
 3   hdd  1.81898         osd.3      up  1.00000 1.00000
 4   hdd  1.81898         osd.4      up  1.00000 1.00000
 5   hdd  1.81898         osd.5      up  1.00000 1.00000
 6   hdd  1.81898         osd.6      up  1.00000 1.00000
 7   hdd  1.81898         osd.7      up  1.00000 1.00000
 8   hdd  1.81898         osd.8      up  1.00000 1.00000
 15  hdd  1.81898         osd.15     up  1.00000 1.00000
 10  nvme 0.93100         osd.10     up  1.00000 1.00000
 0   ssd  0.93100         osd.0      up  1.00000 1.00000
 9   ssd  0.93100         osd.9      up  1.00000 1.00000

If the automatic device class detection fails, for example because the device driver is not properly exposing information about the device via /sys/block, you can adjust device classes from the command line:

cephuser@adm > ceph osd crush rm-device-class osd.2 osd.3
done removing class of osd(s): 2,3
cephuser@adm > ceph osd crush set-device-class ssd osd.2 osd.3
set osd(s) 2,3 to class 'ssd'

17.1.1.3 Setting CRUSH placement rules

CRUSH rules can restrict placement to a specific device class. For example, you can create a 'fast' replicated pool that distributes data only over SSD disks by running the following command:

cephuser@adm > ceph osd crush rule create-replicated RULE_NAME ROOT FAILURE_DOMAIN_TYPE DEVICE_CLASS

For example:

cephuser@adm > ceph osd crush rule create-replicated fast default host ssd

Create a pool named 'fast_pool' and assign it to the 'fast' rule:

cephuser@adm > ceph osd pool create fast_pool 128 128 replicated fast

The process for creating erasure code rules is slightly different. First, you create an erasure code profile that includes a property for your desired device class. Then, use that profile when creating the erasure coded pool:

cephuser@adm > ceph osd erasure-code-profile set myprofile \
 k=4 m=2 crush-device-class=ssd crush-failure-domain=host
cephuser@adm > ceph osd pool create mypool 64 erasure myprofile

In case you need to manually edit the CRUSH Map to customize your rule, the syntax has been extended to allow the device class to be specified. For example, the CRUSH rule generated by the above commands looks as follows:

rule ecpool {
  id 2
  type erasure
  min_size 3
  max_size 6
  step set_chooseleaf_tries 5
  step set_choose_tries 100
  step take default class ssd
  step chooseleaf indep 0 type host
  step emit
}

The important difference here is that the 'take' command includes the additional 'class CLASS_NAME' suffix.

17.1.1.4 Additional commands

To list device classes used in a CRUSH Map, run:

cephuser@adm > ceph osd crush class ls
[
  "hdd",
  "ssd"
]

To list existing CRUSH rules, run:

cephuser@adm > ceph osd crush rule ls
replicated_rule
fast

To view details of the CRUSH rule named 'fast', run:

cephuser@adm > ceph osd crush rule dump fast
{
		"rule_id": 1,
		"rule_name": "fast",
		"ruleset": 1,
		"type": 1,
		"min_size": 1,
		"max_size": 10,
		"steps": [
						{
										"op": "take",
										"item": -21,
										"item_name": "default~ssd"
						},
						{
										"op": "chooseleaf_firstn",
										"num": 0,
										"type": "host"
						},
						{
										"op": "emit"
						}
		]
}

To list OSDs that belong to an 'ssd' class, run:

cephuser@adm > ceph osd crush class ls-osd ssd
0
1

17.1.1.5 Migrating from a legacy SSD rule to device classes

In SUSE Enterprise Storage prior to version 5, you needed to manually edit the CRUSH Map and maintain a parallel hierarchy for each specialized device type (such as SSD) in order to write rules that apply to these devices. Since SUSE Enterprise Storage 5, the device class feature has enabled this transparently.

You can transform a legacy rule and hierarchy to the new class-based rules by using the crushtool command. There are several types of transformation possible:

crushtool --reclassify-root ROOT_NAME DEVICE_CLASS

This command takes everything in the hierarchy beneath ROOT_NAME and adjusts any rules that reference that root via

take ROOT_NAME

to instead

take ROOT_NAME class DEVICE_CLASS

It renumbers the buckets so that the old IDs are used for the specified class's 'shadow tree'. As a consequence, no data movement occurs.

Example 17.1: crushtool --reclassify-root

Consider the following existing rule:

rule replicated_ruleset {
   id 0
   type replicated
   min_size 1
   max_size 10
   step take default
   step chooseleaf firstn 0 type rack
   step emit
}

If you reclassify the root 'default' as class 'hdd', the rule will become

rule replicated_ruleset {
   id 0
   type replicated
   min_size 1
   max_size 10
   step take default class hdd
   step chooseleaf firstn 0 type rack
   step emit
}
crushtool --set-subtree-class BUCKET_NAME DEVICE_CLASS

This method marks every device in the subtree rooted at BUCKET_NAME with the specified device class.

--set-subtree-class is normally used in conjunction with the --reclassify-root option to ensure that all devices in that root are labeled with the correct class. However, some of those devices may intentionally have a different class, and therefore you do not want to relabel them. In such cases, exclude the --set-subtree-class option. Keep in mind that such remapping will not be perfect, because the previous rule is distributed across devices of multiple classes but the adjusted rules will only map to devices of the specified device class.

crushtool --reclassify-bucket MATCH_PATTERN DEVICE_CLASS DEFAULT_PATTERN

This method allows merging a parallel type-specific hierarchy with the normal hierarchy. For example, many users have CRUSH Maps similar to the following one:

Example 17.2: crushtool --reclassify-bucket
host node1 {
   id -2           # do not change unnecessarily
   # weight 109.152
   alg straw
   hash 0  # rjenkins1
   item osd.0 weight 9.096
   item osd.1 weight 9.096
   item osd.2 weight 9.096
   item osd.3 weight 9.096
   item osd.4 weight 9.096
   item osd.5 weight 9.096
   [...]
}

host node1-ssd {
   id -10          # do not change unnecessarily
   # weight 2.000
   alg straw
   hash 0  # rjenkins1
   item osd.80 weight 2.000
   [...]
}

root default {
   id -1           # do not change unnecessarily
   alg straw
   hash 0  # rjenkins1
   item node1 weight 110.967
   [...]
}

root ssd {
   id -18          # do not change unnecessarily
   # weight 16.000
   alg straw
   hash 0  # rjenkins1
   item node1-ssd weight 2.000
   [...]
}

This function reclassifies each bucket that matches a given pattern. The pattern can look like %suffix or prefix%. In the above example, you would use the pattern %-ssd. For each matched bucket, the remaining portion of the name that matches the '%' wild card specifies the base bucket. All devices in the matched bucket are labeled with the specified device class and then moved to the base bucket. If the base bucket does not exist (for example, if 'node12-ssd' exists but 'node12' does not), then it is created and linked underneath the specified default parent bucket. The old bucket IDs are preserved for the new shadow buckets to prevent data movement. Rules with the take steps that reference the old buckets are adjusted.

crushtool --reclassify-bucket BUCKET_NAME DEVICE_CLASS BASE_BUCKET

You can use the --reclassify-bucket option without a wild card to map a single bucket. For example, in the previous example, we want the 'ssd' bucket to be mapped to the default bucket.

The final command to convert the map comprised of the above fragments would be as follows:

cephuser@adm > ceph osd getcrushmap -o original
cephuser@adm > crushtool -i original --reclassify \
  --set-subtree-class default hdd \
  --reclassify-root default hdd \
  --reclassify-bucket %-ssd ssd default \
  --reclassify-bucket ssd ssd default \
  -o adjusted

In order to verify that the conversion is correct, there is a --compare option that tests a large sample of inputs to the CRUSH Map and compares if the same result comes back out. These inputs are controlled by the same options that apply to the --test. For the above example, the command would be as follows:

cephuser@adm > crushtool -i original --compare adjusted
rule 0 had 0/10240 mismatched mappings (0)
rule 1 had 0/10240 mismatched mappings (0)
maps appear equivalent
Tip
Tip

If there were differences, you would see what ratio of inputs are remapped in the parentheses.

If you are satisfied with the adjusted CRUSH Map, you can apply it to the cluster:

cephuser@adm > ceph osd setcrushmap -i adjusted

17.1.1.6 For more information

Find more details on CRUSH Maps in Section 17.5, “CRUSH Map manipulation”.

Find more details on Ceph pools in general in Chapter 18, Manage storage pools.

Find more details about erasure coded pools in Chapter 19, Erasure coded pools.

17.2 Buckets

CRUSH maps contain a list of OSDs, which can be organized into a tree-structured arrangement of buckets for aggregating the devices into physical locations. Individual OSDs comprise the leaves on the tree.

0

osd

A specific device or OSD (osd.1, osd.2, etc.).

1

host

The name of a host containing one or more OSDs.

2

chassis

Identifier for which chassis in the rack contains the host.

3

rack

A computer rack. The default is unknownrack.

4

row

A row in a series of racks.

5

pdu

Abbreviation for "Power Distribution Unit".

6

pod

Abbreviation for "Point of Delivery": in this context, a group of PDUs, or a group of rows of racks.

7

room

A room containing rows of racks.

8

datacenter

A physical data center containing one or more rooms.

9

region

Geographical region of the world (for example, NAM, LAM, EMEA, APAC etc.)

10

root

The root node of the tree of OSD buckets (normally set to default).

Tip
Tip

You can modify the existing types and create your own bucket types.

Ceph's deployment tools generate a CRUSH Map that contains a bucket for each host, and a root named 'default', which is useful for the default rbd pool. The remaining bucket types provide a means for storing information about the physical location of nodes/buckets, which makes cluster administration much easier when OSDs, hosts, or network hardware malfunction and the administrator needs access to physical hardware.

A bucket has a type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity/capability of its item(s), the bucket algorithm ( straw2 by default), and the hash (0 by default, reflecting CRUSH Hash rjenkins1). A bucket may have one or more items. The items may consist of other buckets or OSDs. Items may have a weight that reflects the relative weight of the item.

[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw2 | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}

The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.

host ceph-osd-server-1 {
        id -17
        alg straw2
        hash 0
        item osd.0 weight 0.546
        item osd.1 weight 0.546
}

row rack-1-row-1 {
        id -16
        alg straw2
        hash 0
        item ceph-osd-server-1 weight 2.00
}

rack rack-3 {
        id -15
        alg straw2
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}

rack rack-2 {
        id -14
        alg straw2
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}

rack rack-1 {
        id -13
        alg straw2
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}

room server-room-1 {
        id -12
        alg straw2
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}

datacenter dc-1 {
        id -11
        alg straw2
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}

root data {
        id -10
        alg straw2
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}

17.3 Rule sets

CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH Map has a rule for the default root. If you want more roots and more rules, you need to create them later or they will be created automatically when new pools are created.

Note
Note

In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.

A rule takes the following form:

rule rulename {

        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step

}
ruleset

An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is 0.

type

A string. Describes a rule for either a 'replicated' or 'erasure' coded pool. This option is required. Default is replicated.

min_size

An integer. If a pool group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 2.

max_size

An integer. If a pool group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 10.

step take bucket

Takes a bucket specified by a name, and begins iterating down the tree. This option is required. For an explanation about iterating through the tree, see Section 17.3.1, “Iterating the node tree”.

step targetmodenum type bucket-type

target can either be choose or chooseleaf. When set to choose, a number of buckets is selected. chooseleaf directly selects the OSDs (leaf nodes) from the sub-tree of each bucket in the set of buckets.

mode can either be firstn or indep. See Section 17.3.2, “firstn and indep.

Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows step take or step choose.

step emit

Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows step choose.

17.3.1 Iterating the node tree

The structure defined with the buckets can be viewed as a node tree. Buckets are nodes and OSDs are leafs in this tree.

Rules in the CRUSH Map define how OSDs are selected from this tree. A rule starts with a node and then iterates down the tree to return a set of OSDs. It is not possible to define which branch needs to be selected. Instead the CRUSH algorithm assures that the set of OSDs fulfills the replication requirements and evenly distributes the data.

With step take bucket the iteration through the node tree begins at the given bucket (not bucket type). If OSDs from all branches in the tree are to be returned, the bucket must be the root bucket. Otherwise the following steps are only iterating through a sub-tree.

After step take one or more step choose entries follow in the rule definition. Each step choose chooses a defined number of nodes (or branches) from the previously selected upper node.

In the end the selected OSDs are returned with step emit.

step chooseleaf is a convenience function that directly selects OSDs from branches of the given bucket.

Figure 17.2, “Example tree” provides an example of how step is used to iterate through a tree. The orange arrows and numbers correspond to example1a and example1b, while blue corresponds to example2 in the following rule definitions.

Example tree
Figure 17.2: Example tree
# orange arrows
rule example1a {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2)
        step choose firstn 0 host
        # orange (3)
        step choose firstn 1 osd
        step emit
}

rule example1b {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2) + (3)
        step chooseleaf firstn 0 host
        step emit
}

# blue arrows
rule example2 {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # blue (1)
        step take room1
        # blue (2)
        step chooseleaf firstn 0 rack
        step emit
}

17.3.2 firstn and indep

A CRUSH rule defines replacements for failed nodes or OSDs (see Section 17.3, “Rule sets”). The keyword step requires either firstn or indep as parameter. Figure 17.3, “Node replacement methods” provides an example.

firstn adds replacement nodes to the end of the list of active nodes. In case of a failed node, the following healthy nodes are shifted to the left to fill the gap of the failed node. This is the default and desired method for replicated pools, because a secondary node already has all data and therefore can take over the duties of the primary node immediately.

indep selects fixed replacement nodes for each active node. The replacement of a failed node does not change the order of the remaining nodes. This is desired for erasure coded pools. In erasure coded pools the data stored on a node depends on its position in the node selection. When the order of nodes changes, all data on affected nodes needs to be relocated.

Node replacement methods
Figure 17.3: Node replacement methods

17.4 Placement groups

Ceph maps objects to placement groups (PGs). Placement groups are shards or fragments of a logical object pool that place objects as a group into OSDs. Placement groups reduce the amount of per-object metadata when Ceph stores the data in OSDs. A larger number of placement groups—for example, 100 per OSD—leads to better balancing.

17.4.1 Using placement groups

A placement group (PG) aggregates objects within a pool. The main reason is that tracking object placement and metadata on a per-object basis is computationally expensive. For example, a system with millions of objects cannot track placement of each of its objects directly.

Placement groups in a pool
Figure 17.4: Placement groups in a pool

The Ceph client will calculate to which placement group an object will belong to. It does this by hashing the object ID and applying an operation based on the number of PGs in the defined pool and the ID of the pool.

The object's contents within a placement group are stored in a set of OSDs. For example, in a replicated pool of size two, each placement group will store objects on two OSDs:

Placement groups and OSDs
Figure 17.5: Placement groups and OSDs

If OSD #2 fails, another OSD will be assigned to placement group #1 and will be filled with copies of all objects in OSD #1. If the pool size is changed from two to three, an additional OSD will be assigned to the placement group and will receive copies of all objects in the placement group.

Placement groups do not own the OSD, they share it with other placement groups from the same pool or even other pools. If OSD #2 fails, the placement group #2 will also need to restore copies of objects, using OSD #3.

When the number of placement groups increases, the new placement groups will be assigned OSDs. The result of the CRUSH function will also change and some objects from the former placement groups will be copied over to the new placement groups and removed from the old ones.

17.4.2 Determining the value of PG_NUM

Note
Note

Since Ceph Nautilus (v14.x), you can use the Ceph Manager pg_autoscaler module to auto-scale the PGs as needed. If you want to enable this feature, refer to Section 8.1.1.1, “Default PG and PGP counts”.

When creating a new pool, you can still choose the value of PG_NUM manually:

# ceph osd pool create POOL_NAME PG_NUM

PG_NUM cannot be calculated automatically. Following are a few commonly used values, depending on the number of OSDs in the cluster:

Less than 5 OSDs:

Set PG_NUM to 128.

Between 5 and 10 OSDs:

Set PG_NUM to 512.

Between 10 and 50 OSDs:

Set PG_NUM to 1024.

As the number of OSDs increases, choosing the right value for PG_NUM becomes more important. PG_NUM strongly affects the behavior of the cluster as well as the durability of the data in case of OSD failure.

17.4.2.1 Calculating placement groups for more than 50 OSDs

If you have less than 50 OSDs, use the preselection described in Section 17.4.2, “Determining the value of PG_NUM. If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability, and distribution. For a single pool of objects, you can use the following formula to get a baseline:

total PGs = (OSDs * 100) / POOL_SIZE

Where POOL_SIZE is either the number of replicas for replicated pools, or the 'k'+'m' sum for erasure coded pools as returned by the ceph osd erasure-code-profile get command. You should round the result up to the nearest power of 2. Rounding up is recommended for the CRUSH algorithm to evenly balance the number of objects among placement groups.

As an example, for a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate the number of PGs as follows:

          (200 * 100) / 3 = 6667

The nearest power of 2 is 8192.

When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD. You need to reach a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow.

For example, a cluster of 10 pools, each with 512 placement groups on 10 OSDs, is a total of 5,120 placement groups spread over 10 OSDs, that is 512 placement groups per OSD. Such a setup does not use too many resources. However, if 1000 pools were created with 512 placement groups each, the OSDs would handle approximately 50,000 placement groups each and it would require significantly more resources and time for peering.

17.4.3 Setting the number of placement groups

Note
Note

Since Ceph Nautilus (v14.x), you can use the Ceph Manager pg_autoscaler module to auto-scale the PGs as needed. If you want to enable this feature, refer to Section 8.1.1.1, “Default PG and PGP counts”.

If you still need to specify the number of placement groups in a pool manually, you need to specify them at the time of pool creation (see Section 18.1, “Creating a pool”). Once you have set placement groups for a pool, you may increase the number of placement groups by running the following command:

# ceph osd pool set POOL_NAME pg_num PG_NUM

After you increase the number of placement groups, you also need to increase the number of placement groups for placement (PGP_NUM) before your cluster will rebalance. PGP_NUM will be the number of placement groups that will be considered for placement by the CRUSH algorithm. Increasing PG_NUM splits the placement groups but data will not be migrated to the newer placement groups until PGP_NUM is increased. PGP_NUM should be equal to PG_NUM. To increase the number of placement groups for placement, run the following:

# ceph osd pool set POOL_NAME pgp_num PGP_NUM

17.4.4 Finding the number of placement groups

To find out the number of placement groups in a pool, run the following get command:

# ceph osd pool get POOL_NAME pg_num

17.4.5 Finding a cluster's PG statistics

To find out the statistics for the placement groups in your cluster, run the following command:

# ceph pg dump [--format FORMAT]

Valid formats are 'plain' (default) and 'json'.

17.4.6 Finding statistics for stuck PGs

To find out the statistics for all placement groups stuck in a specified state, run the following:

# ceph pg dump_stuck STATE \
 [--format FORMAT] [--threshold THRESHOLD]

STATE is one of 'inactive' (PGs cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come up), 'unclean' (PGs contain objects that are not replicated the desired number of times), 'stale' (PGs are in an unknown state—the OSDs that host them have not reported to the monitor cluster in a time interval specified by the mon_osd_report_timeout option), 'undersized', or 'degraded'.

Valid formats are 'plain' (default) and 'json'.

The threshold defines the minimum number of seconds the placement group is stuck before including it in the returned statistics (300 seconds by default).

17.4.7 Searching a placement group map

To search for the placement group map for a particular placement group, run the following:

# ceph pg map PG_ID

Ceph will return the placement group map, the placement group, and the OSD status:

# ceph pg map 1.6c
osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]

17.4.8 Retrieving a placement groups statistics

To retrieve statistics for a particular placement group, run the following:

# ceph pg PG_ID query

17.4.9 Scrubbing a placement group

To scrub (Section 17.6, “Scrubbing placement groups”) a placement group, run the following:

# ceph pg scrub PG_ID

Ceph checks the primary and replica nodes, generates a catalog of all objects in the placement group, and compares them to ensure that no objects are missing or mismatched and their contents are consistent. Assuming the replicas all match, a final semantic sweep ensures that all of the snapshot-related object metadata is consistent. Errors are reported via logs.

17.4.10 Prioritizing backfill and recovery of placement groups

You may run into a situation where several placement groups require recovery and/or back-fill, while some groups hold data more important than others. For example, those PGs may hold data for images used by running machines and other PGs may be used by inactive machines or less relevant data. In that case, you may want to prioritize recovery of those groups so that performance and availability of data stored on those groups is restored earlier. To mark particular placement groups as prioritized during backfill or recovery, run the following:

# ceph pg force-recovery PG_ID1 [PG_ID2 ... ]
# ceph pg force-backfill PG_ID1 [PG_ID2 ... ]

This will cause Ceph to perform recovery or backfill on specified placement groups first, before other placement groups. This does not interrupt currently ongoing backfills or recovery, but causes specified PGs to be processed as soon as possible. If you change your mind or prioritize wrong groups, cancel the prioritization:

# ceph pg cancel-force-recovery PG_ID1 [PG_ID2 ... ]
# ceph pg cancel-force-backfill PG_ID1 [PG_ID2 ... ]

The cancel-* commands remove the 'force' flag from the PGs so that they are processed in default order. Again, this does not affect placement groups currently being processed, only those that are still queued. The 'force' flag is cleared automatically after recovery or backfill of the group is done.

17.4.11 Reverting lost objects

If the cluster has lost one or more objects and you have decided to abandon the search for the lost data, you need to mark the unfound objects as 'lost'.

If the objects are still lost after having queried all possible locations, you may need to give up on the lost objects. This is possible given unusual combinations of failures that allow the cluster to learn about writes that were performed before the writes themselves are recovered.

Currently the only supported option is 'revert', which will either roll back to a previous version of the object, or forget about it entirely in case of a new object. To mark the 'unfound' objects as 'lost', run the following:

  cephuser@adm > ceph pg PG_ID mark_unfound_lost revert|delete

17.4.12 Enabling the PG auto-scaler

Placement groups (PGs) are an internal implementation detail of how Ceph distributes data. By enabling pg-autoscaling, you can allow the cluster to either make or automatically tune PGs based on how the cluster is used.

Each pool in the system has a pg_autoscale_mode property that can be set to off, on, or warn:

The autoscaler is configured on a per-pool basis, and can run in three modes:

off

Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool.

on

Enable automated adjustments of the PG count for the given pool.

warn

Raise health alerts when the PG count should be adjusted.

To set the autoscaling mode for existing pools:

cephuser@adm > ceph osd pool set POOL_NAME pg_autoscale_mode mode

You can also configure the default pg_autoscale_mode that is applied to any pools that are created in the future with:

cephuser@adm > ceph config set global osd_pool_default_pg_autoscale_mode MODE

You can view each pool, its relative utilization, and any suggested changes to the PG count with this command:

cephuser@adm > ceph osd pool autoscale-status

17.5 CRUSH Map manipulation

This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.

17.5.1 Editing a CRUSH Map

To edit an existing CRUSH map, do the following:

  1. Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following:

    cephuser@adm > ceph osd getcrushmap -o compiled-crushmap-filename

    Ceph will output (-o) a compiled CRUSH Map to the file name you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.

  2. Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following:

    cephuser@adm > crushtool -d compiled-crushmap-filename \
     -o decompiled-crushmap-filename

    Ceph will decompile (-d) the compiled CRUSH Mapand output (-o) it to the file name you specified.

  3. Edit at least one of Devices, Buckets and Rules parameters.

  4. Compile a CRUSH Map. To compile a CRUSH Map, execute the following:

    cephuser@adm > crushtool -c decompiled-crush-map-filename \
     -o compiled-crush-map-filename

    Ceph will store a compiled CRUSH Mapto the file name you specified.

  5. Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following:

    cephuser@adm > ceph osd setcrushmap -i compiled-crushmap-filename

    Ceph will input the compiled CRUSH Map of the file name you specified as the CRUSH Map for the cluster.

Tip
Tip: Use versioning system

Use a versioning system—such as git or svn—for the exported and modified CRUSH Map files. It makes a possible rollback simple.

Tip
Tip: Test the new CRUSH Map

Test the new adjusted CRUSH Map using the crushtool --test command, and compare to the state before applying the new CRUSH Map. You may find the following command switches useful: --show-statistics, --show-mappings, --show-bad-mappings, --show-utilization, --show-utilization-all, --show-choose-tries

17.5.2 Adding or moving an OSD

To add or move an OSD in the CRUSH Map of a running cluster, execute the following:

cephuser@adm > ceph osd crush set id_or_name weight root=pool-name
bucket-type=bucket-name ...
id

An integer. The numeric ID of the OSD. This option is required.

name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

root

A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required.

bucket-type

Key/value pairs. You may specify the OSD's location in the CRUSH hierarchy.

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

cephuser@adm > ceph osd crush set osd.0 1.0 root=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-1

17.5.3 Difference between ceph osd reweight and ceph osd crush reweight

There are two similar commands that change the 'weight' of a Ceph OSD. The context of their usage is different and may cause confusion.

17.5.3.1 ceph osd reweight

Usage:

cephuser@adm > ceph osd reweight OSD_NAME NEW_WEIGHT

ceph osd reweight sets an override weight on the Ceph OSD. This value is in the range of 0 to 1, and forces CRUSH to reposition the data that would otherwise live on this drive. It does not change the weights assigned to the buckets above the OSD, and is a corrective measure in case the normal CRUSH distribution is not working out quite right. For example, if one of your OSDs is at 90% and the others are at 40%, you could reduce this weight to try and compensate for it.

Note
Note: OSD weight is temporary

Note that ceph osd reweight is not a persistent setting. When an OSD gets marked out, its weight will be set to 0 and when it gets marked in again, the weight will be changed to 1.

17.5.3.2 ceph osd crush reweight

Usage:

cephuser@adm > ceph osd crush reweight OSD_NAME NEW_WEIGHT

ceph osd crush reweight sets the CRUSH weight of the OSD. This weight is an arbitrary value—generally the size of the disk in TB—and controls how much data the system tries to allocate to the OSD.

17.5.4 Removing an OSD

To remove an OSD from the CRUSH Map of a running cluster, execute the following:

cephuser@adm > ceph osd crush remove OSD_NAME

17.5.5 Adding a bucket

To add a bucket to the CRUSH Map of a running cluster, execute the ceph osd crush add-bucket command:

cephuser@adm > ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE

17.5.6 Moving a bucket

To move a bucket to a different location or position in the CRUSH Map hierarchy, execute the following:

cephuser@adm > ceph osd crush move BUCKET_NAME BUCKET_TYPE=BUCKET_NAME [...]

For example:

cephuser@adm > ceph osd crush move bucket1 datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1

17.5.7 Removing a bucket

To remove a bucket from the CRUSH Map hierarchy, execute the following:

cephuser@adm > ceph osd crush remove BUCKET_NAME
Note
Note: Empty bucket only

A bucket must be empty before removing it from the CRUSH hierarchy.

17.6 Scrubbing placement groups

In addition to making multiple copies of objects, Ceph ensures data integrity by scrubbing placement groups (find more information about placement groups in Section 1.3.2, “Placement groups”). Ceph scrubbing is analogous to running fsck on the object storage layer. For each placement group, Ceph generates a catalog of all objects and compares each primary object and its replicas to ensure that no objects are missing or mismatched. Daily light scrubbing checks the object size and attributes, while weekly deep scrubbing reads the data and uses checksums to ensure data integrity.

Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:

osd max scrubs

The maximum number of simultaneous scrub operations for a Ceph OSD. Default is 1.

osd scrub begin hour, osd scrub end hour

The hours of day (0 to 24) that define a time window during which the scrubbing can happen. By default, begins at 0 and ends at 24.

Important
Important

If the placement group's scrub interval exceeds the osd scrub max interval setting, the scrub will happen no matter what time window you define for scrubbing.

osd scrub during recovery

Allows scrubs during recovery. Setting this to 'false' will disable scheduling new scrubs while there is an active recovery. Already running scrubs will continue. This option is useful for reducing load on busy clusters. Default is 'true'.

osd scrub thread timeout

The maximum time in seconds before a scrub thread times out. Default is 60.

osd scrub finalize thread timeout

The maximum time in seconds before a scrub finalize thread times out. Default is 60*10.

osd scrub load threshold

The normalized maximum load. Ceph will not scrub when the system load (as defined by the ratio of getloadavg() / number of online cpus) is higher than this number. Default is 0.5.

osd scrub min interval

The minimal interval in seconds for scrubbing Ceph OSD when the Ceph cluster load is low. Default is 60*60*24 (once a day).

osd scrub max interval

The maximum interval in seconds for scrubbing Ceph OSD, irrespective of cluster load. Default is 7*60*60*24 (once a week).

osd scrub chunk min

The minimum number of object store chunks to scrub during a single operation. Ceph blocks writes to a single chunk during a scrub. Default is 5.

osd scrub chunk max

The maximum number of object store chunks to scrub during a single operation. Default is 25.

osd scrub sleep

Time to sleep before scrubbing the next group of chunks. Increasing this value slows down the whole scrub operation, while client operations are less impacted. Default is 0.

osd deep scrub interval

The interval for 'deep' scrubbing (fully reading all data). The osd scrub load threshold option does not affect this setting. Default is 60*60*24*7 (once a week).

osd scrub interval randomize ratio

Add a random delay to the osd scrub min interval value when scheduling the next scrub job for a placement group. The delay is a random value smaller than the result of osd scrub min interval * osd scrub interval randomized ratio. Therefore, the default setting practically randomly spreads the scrubs out in the allowed time window of [1, 1.5] * osd scrub min interval. Default is 0.5.

osd deep scrub stride

Read size when doing a deep scrub. Default is 524288 (512 kB).