Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Enterprise Storage 6

6 Ceph Tuning Edit source

Ceph includes a telemetry module that provides anonymized information back to the Ceph developer community. The information contained in the telemetry report provides information that helps the developers prioritize efforts and identify areas where more work may be needed. It may be necessary to enable the telemetry module before turning it on. To enable the module:

ceph mgr module enable telemetry

To turn on telemetry reporting use the following command:

ceph telemetry on

Additional information about the Ceph telemetry module may be found in the Administration Guide.

6.1 Obtaining Ceph Metrics Edit source

Before adjusting Ceph tunables, it is helpful to have an understanding of the critical metrics to monitor and what they indicate. Many of these parameters are found by dumping raw data from the daemons. This is done by means of the ceph daemon dump command. The following example shows the dump command being utilized for osd.104.

ceph --admin-daemon /var/run/ceph/ceph-osd.104.asok perf dump

Starting with the Ceph Nautilus release, the following command may be used as well:

ceph daemon osd.104 perf dump

The output of the command is quite lengthy and may benefit from being redirected to a file.

6.2 Making Tuning Persistent Edit source

To make parameter adjustment persistent requires modifying the /etc/ceph/ceph.conf file. This is best done through modifying the source component files that DeepSea uses to manage the cluster. Each section is identified with a header such as:

  [global]
  [osd]
  [mds]
  [mon]
  [mgr]
  [client]

The section of the configuration is tuned by modifying the correct [sectionname].conf in the /srv/salt/ceph/configuration/files/ceph.conf.d/ directory. After modifying the configuration file, replace master with the master minion-name (usually the admin node). The result is that the changes are pushed to all cluster nodes.

  salt 'master' state.apply ceph.configuration.create
  salt '*' state.apply ceph.configuration

Changes made in this way require the affected services to be restarted before taking effect. It is also possible to deploy these files before running stage 2 of the SUSE Enterprise Storage deployment process. It is especially desirable to do so if changing the settings that require node or device re-deployment.

6.3 Core Edit source

6.3.1 Logging Edit source

It is possible to disable all logging to reduce latency in the various codepaths.

Warning
Warning

This tuning should be used with caution and understanding that logging will need to be re-enabled should support be required. This implies that an incident would need to be reproduced after logging is re-enabled.

  debug ms=0
  debug mds=0
  debug osd=0
  debug optracker=0
  debug auth=0
  debug asok=0
  debug bluestore=0
  debug bluefs=0
  debug bdev=0
  debug kstore=0
  debug rocksdb=0
  debug eventtrace=0
  debug default=0
  debug rados=0
  debug client=0
  debug perfcounter=0
  debug finisher=0

6.3.2 Authentication Parameters Edit source

Under certain conditions where the cluster is physically secure and isolated inside a secured network with no external exposure, it is possible to disable cephx. There are two levels at which cephx can be disabled. The first is to disable signing of authentication traffic. This can be accomplished with the following settings:

cephx_require_signatures = false
cephx_cluster_require_signatures = false
cephx_sign_messages = false

The second level of tuning completely disables cephx authentication. This should only be done on networks that are isolated from public network infrastructure. This change is achieved by adding the following three lines in the global section:

auth cluster required = none
auth service required = none
auth client required = none

6.3.3 RADOS Operations Edit source

The backend processes for performing RADOS operations show up in throttle-*objector_ops when dumping various daemons. If there is too much time being spent in wait, there may be some performance to gain by increasing the memory for in-flight ops or by increasing the total number of inflight operations overall.

objecter inflight op bytes = 1073741824 # default 100_M
objecter inflight ops = 24576

6.3.4 OSD parameters Edit source

Increasing the number of op threads may be helpful with SSD and NVMe devices, as it provides more work queues for operations.

osd_op_num_threads_per_shard = 4

6.3.5 RocksDB or WAL device Edit source

In checking the performance of BlueStore, it is important to understand if your metadata is spilling over from the high-speed device, if defined, to the bulk data storage device. The parameters useful in this case are found under bluefs slow_used_bytes. If slow_used_bytes is greater than zero, the cluster is using the storage device instead of the RocksDB/WAL device. This is an indicator that more space needs to be allocated to RocksDB/WAL.

Starting with the Ceph Nautilus release, spillover is shown in the output of the ceph health command.

The process of allocating more space depends on how the OSD was deployed. If it was deployed by a version prior to SUSE Enterprise Storage 6, the OSD will need to be re-deployed. If it was deployed with version 6 or after, it may be possible to expand the LVM that the RocksDB and WAL reside on, subject to available space.

6.3.6 BlueStore parameters Edit source

Ceph is thin provisioned, including the Write-Ahead Log (WAL) files. By pre-extending the files for the WAL, time is saved by not having to engage the allocator. It also potentially reduces the likelihood of fragmentation of the WAL files. This likely only provides benefit during the early life of the cluster.

bluefs_preextend_wal_files=1

BlueStore has the ability to perform buffered writes. Buffered writes enable populating the read cache during the write process. This setting, in effect, changes the BlueStore cache into a write-through cache.

bluestore_default_buffered_write = true

To prevent writes to the WAL when using a fast device, such as SSD and NVMe, set:

prefer_deferred_size_ssd=0 (pre-deployment)

6.3.7 BlueStore Allocation Size Edit source

Important
Important

The following settings are not necessary for fresh deployments. Apply only to upgraded deployments or early SUSE Enterprise Storage 6 deployments as they may still benefit.

The following settings have been shown to slightly improve small object write performance under mixed workload conditions. Reducing the alloc_size to 4 kB helps reduce write amplification for small objects and with erasure coded pools of smaller objects. This change needs to be done before OSD deployment. If done after the fact, the OSDs will need to be re-deployed for it to take effect.

It is advised that spinning media continue to use 64 kB while SSD/NVMe are likely to benefit from setting to 4 kB.

min_alloc_size_ssd=4096
min_alloc_size_hdd=65536
Warning
Warning

Setting the alloc_size_ssd to 64 kB may reduce maximum throughput capability of the OSD.

6.3.8 BlueStore Cache Edit source

Increasing the BlueStore cache size can improve performance with many workloads. While FileStore OSDs cache data in the kernel's page cache, BlueStore OSDs cache data within the memory allocated by the OSD daemon itself. The OSD daemon will allocate memory up to its memory target (as controlled by the osd_memory_target parameter), and this determines the potential size of the BlueStore cache. The BlueStore cache is a read cache that by default is populated when objects are read. By setting the cache's minimum size higher than the default, it is guaranteed that the value specified will be the minimum cache available for each OSD. The idea is that more low-probability cache hits may occur.

The default osd_memory_target value is 4 GB: For example, each OSD daemon running on a node can be expected to consume that much memory. If a node's total RAM is significantly higher than number of OSDs × 4 GB and there are no other daemons running on the node, performance can be increased by increasing the value of osd_memory_target. This should be done with care to ensure that the operating system will still have enough memory for its needs, while leaving a safety margin.

If you want to ensure that the BlueStore cache will not fall below a certain minimum, use the osd_memory_cache_min parameter. Here is an example (the values are expressed in bytes):

osd_memory_target = 6442450944
osd_memory_cache_min = 4294967296
Tip
Tip

As a best practice, start with the full memory of the node. Deduct 16 GB or 32 GB for the OS and then deduct appropriate amounts for any other workloads running on the node. For example, MDS cache if the MDS is colocated. Divide the remainder by the number of OSDs on that host. Ensure you leave room for improvement. For example:

(256 GB - 32 GB ) / 20 OSDs = 11,2 GB/OSD (max)

Using this example, configure approximately 8 or 10 GB per OSD.

By default, BlueStore automatically tunes cache ratios between data and key-value data. In some cases it may be helpful to manually tune the ratios or even increase the cache size. There are several relevant counters for the cache:

  • bluestore_onode_hits

  • bluestore_onode_misses

  • bluestore_onode_shard_hits

  • bluestore_onode_shard_misses

If the misses are high, it is worth experimenting with increasing the cache settings or adjusting the ratios.

Adjusting the BlueStore cache size above default has the potential to improve performance of small-block workloads. This can be done globally by adjusting the _cache_size value. By default, the cluster utilizes different values for HDD and SSD/NVMe devices. The best practice would be to increase the specific media cache you are interested in tuning:

  • bluestore_cache_size_hdd (default 1073741824 - 1 GB)

  • bluestore_cache_size_ssd (default 3221225472 - 3 GB)

Note
Note

If the cache size parameters are adjusted and auto mode is utilized, osd_memory_target should be adjusted to accomodate the OSD base RAM and cache allocation.

In some cases, manually tuning the cache allocation percentage may improve performance. This is achieved by modifying the disabling autotuning of the cache with this configuration line:

bluestore_cache_autotune=0

Changing this value invalidates tuning of the osd_memory_cache_min value.

The cache allocations are modified by adjusting the following:

  • bluestore_cache_kv_ratio (default .4)

  • bluestore_cache_meta_ratio values (default .4)

Any unspecified portion is used for caching the objects themselves.

6.4 RBD Edit source

6.4.1 RBD Cluster Edit source

As RBD is a native protocol, the tuning is directly related to OSD or general Ceph core options that are covered in previous sections.

6.4.2 RBD Client Edit source

Read ahead cache defaults to 512 kB; test by tuning up and down on the client nodes.

echo {bytes} > /sys/block/rbd0/queue/read_ahead_kb

If your workload performs large sequential reads such as backup and restore, then this can make a significant difference in restore performance.

6.5 CephFS Edit source

Most of the performance tuning covered in this section pertains to the CephFS Metadata Servers. Because CephFS is a native protocol, much of the performance tuning is handled at the operating system, OSD and BlueStore layers. Being a file system that is mounted by a client, there are some client options that are covered in the client section.

6.5.1 MDS Tuning Edit source

In filesystems with millions of files, there is some advantage to utilizing very low-latency media, such as NVMe, for the CephFS metadata pool.

Utlizing the ceph-daemon perf dump command, there is a significant amount of data that can be examined for the Ceph Metadata Servers. It should be noted that the MDS perf counters only apply to metadata operations. The actual IO path is from clients straight to OSDs.

CephFS supports multiple metadata servers. These servers can operate in a multiple-active mode to provide load balancing of the metadata operation requests. To identify whether the MDS infrastructure is under-performing, one would examine the MDS data for request count and reply latencies. This should be done during an idle period on the cluster to form a baseline and then compared when under load. If the average time for reply latencies climbs too high, the MDS server needs to be examined further to identify whether the number of active metadata servers should be augmented, or whether simply increasing the metadata server cache may be sufficient. A sample of the output from the general MDS data for count and reply latency are below:

  "mds": {
    # request count, interesting to get a sense of MDS load
           "request": 0,
           "reply": 0,
    # reply and the latencies of replies can point to load issues
           "reply_latency": {
               "avgcount": 0,
               "sum": 0.000000000,
               "avgtime": 0.000000000
           }
          }

Examining the mds_mem section of the output can help with understanding how the cache is utilized. High inode counters can indicate that a large number of files are open concurrently. This generally indicates that more memory may need to be provided to the MDS. If MDS memory cannot be increased, additional active MDS daemons should be deployed.

  "mds_mem": {

           "ino": 13,
           "ino+": 13,
           "ino-": 0,
           "dir": 12,
           "dir+": 12,
           "dir-": 0,
           "dn": 10,
           "dn+": 10,
           "dn-": 0,

A high cap count can indicate misbehaving clients. For example, clients that do not hand back caps. This may indicate that some clients need to be upgraded to a more recent version, or that the client needs to be investigated for possible issues.

  "cap": 0,
  "cap+": 0,
  "cap-": 0,

This final section shows memory utilization. The RSS value is the current memory size used. If this is roughly equal to the mds_cache_memory_limit, the MDS could probably use more memory.

  "rss": 41524,
  "heap": 314072
},

Another important aspect of tuning a distributed file system is recognizing problematic workloads. The output values below provide some insight to what the MDS daemon is spending its time on. Each heading has the same three attributes as the req_create_latency. With this information, it may be possible to better tune the workloads.

  "mds_server": {
           "dispatch_client_request": 0,
           "dispatch_server_request": 0,
           "handle_client_request": 0,
           "handle_client_session": 0,
           "handle_slave_request": 0,
           "req_create_latency": {
               "avgcount": 0,
               "sum": 0.000000000,
               "avgtime": 0.000000000
           },
           "req_getattr_latency": {},
           "req_getfilelock_latency": {},
           "req_link_latency": {},
           "req_lookup_latency": {},
           "req_lookuphash_latency": {},
           "req_lookupino_latency": {},
           "req_lookupname_latency": {},
           "req_lookupparent_latency": {},
           "req_lookupsnap_latency": {},
           "req_lssnap_latency": {},
           "req_mkdir_latency": {},
           "req_mknod_latency": {},
           "req_mksnap_latency": {},
           "req_open_latency": {},
           "req_readdir_latency": {},
           "req_rename_latency": {},
           "req_renamesnap_latency": {},
           "req_rmdir_latency": {},
           "req_rmsnap_latency": {},
           "req_rmxattr_latency": {},
           "req_setattr_latency": {},
           "req_setdirlayout_latency": {},
           "req_setfilelock_latency": {},
           "req_setlayout_latency": {},
           "req_setxattr_latency": {},
           "req_symlink_latency": {},
           "req_unlink_latency": {},
       }

Tuning the metadata server cache allows for more metadata operations to come from RAM, resulting in improved performance. The example below sets the cache to 16 GB.

mds_cache_memory_limit=17179869184

6.5.2 CephFS - Client Edit source

From the client side, there are a number of performance affecting mount options that can be employed. It is important to understand the potential impact on the applications being utilized before employing these options.

The following mount options may be adjusted to improve performance, but we recommend that their impact is clearly understood prior to implementation in a production environment.

noacl

Setting this mount option disables POSIX Access Control Lists for the CephFS mount, lowering potential metadata overhead.

noatime

This option prevents the access time metadata for files from being updated.

nodiratime

Setting this option prevents the metadata for access time of a directory from being updated.

nocrc

This disables CephFS CRCs, thus relying on TCP Checksums for the correctness of the data to be verified.

rasize

Setting a larger read-ahead for the mount may increase performance for large, sequential operations. Default is 8 MiB.

6.6 RGW Edit source

There are a large number of tunables for the Rados GateWay (RGW). These may be specific to the types of workloads being handled by the gateway and it may make sense to have different gateways handling distictly different workloads.

6.6.1 Sharding Edit source

The ideal situation is to understand how many total objects a bucket will host as this allows a bucket to be created with an appropriate number of shards at the outset. To gather information on bucket sharding, issue:

radosgw-admin bucket limit check

The output of this command appears like the following format:

  "user_id": "myusername",
          "buckets": [
              {
                  "bucket": "mybucketname",
                  "tenant": "",
                  "num_objects": 611493,
                  "num_shards": 50,
                  "objects_per_shard": 12229,
                  "fill_status": "OK"
              }
          ]

By default, Ceph reshards buckets to try and maintain reasonable performance. If it is known ahead of time how many shards a bucket may need, based on a ratio of 1 shard per 100 000 objects, it may be pre-sharded. This reduces contention and potential latency issues when resharding will occur. To pre-shard the bucket, it should be created and then submitted for sharding with the rgw-admin command. For example:

radosgw-admin bucket reshard --bucket={bucket name} --num-shards={prime number}

Where the num-shards is a prime number. Each shard should represent about 100 000 objects.

6.6.2 Limiting Bucket Listing Results Edit source

If a process relies on listing the buckets on a frequent basis to iterate through results, yet only uses a small number of results for each iteration it is useful to set the rgw_max_listing_results parameter.

6.6.3 Parallel I/O Requests Edit source

By default, the Object Gateway process is limited to eight simultaneous I/O operations for the index. This can be adjusted with the rgw_bucket_index_max_aio parameter.

6.6.4 Window Size Edit source

When working with larger objects, increasing the size of the Object Gateway windows for put and get can help with performance. Modify the following values in the Object Gateway section of the configuration:

rgw put obj min window size = [size in bytes, 16MiB default]
rgw get obj min window size = [size in bytes, 16MiB default]

6.6.5 Nagle's Algorithm Edit source

Nagle's algorithm was introduced to maximize the use of buffers and attempt to reduce the number of small packets transmitted over the network. While this is helpful in lower bandwdith environments, it can represent a performance degredation in high-bandwidth environments. Disabling it from RGW nodes can improve performance. Including the following in the Ceph configuation RGW section:

tcp_nodelay=1

6.7 Administrative and Usage Choices Edit source

6.7.1 Data Protection Schemes Edit source

The default replication setting keeps three total copies of every object written. The provides a high level of data protection by allowing up to two devices or nodes to fail while still protecting the data.

There are use cases where protecting the data is not important, but where performance is. In these cases, such as HPC scratch storage, it may be worthwhile to lower the replication count. This can be achieved by issuing a command such as:

ceph osd pool set rbd size 2

6.7.2 Erasure Coding Edit source

When using erasure coding, it is best to utilize optimized coding pool sizes. Experimental data suggests that the optimial pool sizes have either four or eight data chunks. It is also important to map this in relation to your failure domain model. If your cluster failure domain is at the node level, you will need at least k+m number of nodes. Similarly, if your failure domain it at the rack level, then your cluster needs to be spread over k+m racks. The key consideration is that distribution of the data in relation to the failure domain should be taken into consideration.

When using erasure coding schemes with failure domains larger than a single node, the use of Local Reconstruction Codes (LRC) may be beneficial due to lowered utilization of the network backbone, especially during failure and recovery scenatios.

There are particular use cases where erasure coding may even increase performance. These are mostly limited to large block (1 MB+) sequential read/write workloads. This is due to the parallelization of I/O requests that occurs when splitting objects into chunks to write to multiple OSDs.

Print this page