12 Advanced configuration #
12.1 Performing advanced configuration tasks #
These examples show how to perform advanced configuration tasks on your Rook storage cluster.
12.1.1 Prerequisites #
Most of the examples make use of the ceph
client
command. A quick way to use the Ceph client suite is from a
Rook
Toolbox container.
The Kubernetes based examples assume Rook OSD pods are in the
rook-ceph
namespace. If you run them in a different
namespace, modify kubectl -n rook-ceph [...]
to fit your
situation.
12.1.2 Using custom Ceph user and secret for mounting #
For extensive info about creating Ceph users, refer to 30.2.2項 「ユーザの管理」
Using a custom Ceph user and secret key can be done for both file system and block storage.
Create a custom user in Ceph with read-write access in the
/bar
directory on CephFS (For Ceph Mimic or newer,
use data=POOL_NAME
instead of
pool=POOL_NAME
):
cephuser@adm >
ceph auth get-or-create-key client.user1 mon \
'allow r' osd 'allow rw tag cephfs pool=YOUR_FS_DATA_POOL' \
mds 'allow r, allow rw path=/bar'
The command will return a Ceph secret key. This key should be added as a secret in Kubernetes like this:
kubectl@adm >
kubectl create secret generic ceph-user1-secret --from-literal=key=YOUR_CEPH_KEY
This secret key must be created with the same name in each namespace where the StorageClass will be used.
In addition to this secret key, you must create a RoleBinding to allow the Rook Ceph agent to get the secret from each namespace. The RoleBinding is optional if you are using a ClusterRoleBinding for the Rook Ceph agent secret-key access. A ClusterRole which contains the permissions which are needed and used for the Bindings is shown as an example after the next step.
On a StorageClass parameters
set the following options:
mountUser: user1 mountSecret: ceph-user1-secret
If you want the Rook-Ceph agent to require a mountUser
and mountSecret
to be set in StorageClasses using
Rook, you need to set the environment variable
AGENT_MOUNT_SECURITY_MODE
to
Restricted
on the Rook-Ceph Operator deployment.
For more information on using the Ceph feature to limit access to CephFS paths, see http://docs.ceph.com/docs/mimic/cephfs/client-auth/#path-restriction.
12.1.2.1 Creating the ClusterRole
#
When you are using the Helm chart to install the Rook-Ceph Operator, and
have set mountSecurityMode
to, for example,
Restricted
, then the below
ClusterRole
has already been created for you.
This ClusterRole
is needed no
matter whether you want to use one RoleBinding
per
namespace or a ClusterRoleBinding
.
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: rook-ceph-agent-mount labels: operator: rook storage-backend: ceph rules: - apiGroups: - "" resources: - secrets verbs: - get
12.1.2.2 Creating the RoleBinding
#
You either need a RoleBinding
in each namespace in
which a mount secret resides in, or create a
ClusterRoleBinding
with which the Rook Ceph agent
has access to Kubernetes secrets in all namespaces.
Create the RoleBinding
shown here in each namespace for
which the Rook Ceph agent should read secrets for mounting. The
RoleBinding
subjects' namespace
must
be the one the Rook-Ceph agent runs in (default
rook-ceph
for version 1.0 and newer; for previous
versions, the default namespace was rook-ceph-system
).
Replace namespace:
name-of-namespace-with-mountsecret
according to the name of all namespaces a mountSecret
can be in.
kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: rook-ceph-agent-mount namespace: name-of-namespace-with-mountsecret labels: operator: rook storage-backend: ceph roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: rook-ceph-agent-mount subjects: - kind: ServiceAccount name: rook-ceph-system namespace: rook-ceph
12.1.2.3 Creating the ClusterRoleBinding
#
This ClusterRoleBinding
only needs to be created once,
as it covers the whole cluster.
kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: rook-ceph-agent-mount labels: operator: rook storage-backend: ceph roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: rook-ceph-agent-mount subjects: - kind: ServiceAccount name: rook-ceph-system namespace: rook-ceph
12.1.3 Collecting logs #
All Rook logs can be collected in a Kubernetes environment with the following command:
for p in $(kubectl -n rook-ceph get pods -o jsonpath='{.items[*].metadata.name}') do for c in $(kubectl -n rook-ceph get pod ${p} -o jsonpath='{.spec.containers[*].name}') do echo "BEGIN logs from pod: ${p} ${c}" kubectl -n rook-ceph logs -c ${c} ${p} echo "END logs from pod: ${p} ${c}" done done
This gets the logs for every container in every Rook pod, and then
compresses them into a .gz
archive for easy sharing.
Note that instead of gzip
, you could instead pipe to
less
or to a single text file.
12.1.4 OSD information #
Keeping track of OSDs and their underlying storage devices can be difficult. The following scripts will clear things up quickly.
12.1.4.1 Kubernetes #
# Get OSD Pods # This uses the example/default cluster name "rook" OSD_PODS=$(kubectl get pods --all-namespaces -l \ app=rook-ceph-osd,rook_cluster=rook-ceph -o jsonpath='{.items[*].metadata.name}') # Find node and drive associations from OSD pods for pod in $(echo ${OSD_PODS}) do echo "Pod: ${pod}" echo "Node: $(kubectl -n rook-ceph get pod ${pod} -o jsonpath='{.spec.nodeName}')" kubectl -n rook-ceph exec ${pod} -- sh -c '\ for i in /var/lib/ceph/osd/ceph-*; do [ -f ${i}/ready ] || continue echo -ne "-$(basename ${i}) " echo $(lsblk -n -o NAME,SIZE ${i}/block 2> /dev/null || \ findmnt -n -v -o SOURCE,SIZE -T ${i}) $(cat ${i}/type) done | sort -V echo' done
The output should look as follows:
Pod: osd-m2fz2 Node: node1.zbrbdl -osd0 sda3 557.3G bluestore -osd1 sdf3 110.2G bluestore -osd2 sdd3 277.8G bluestore -osd3 sdb3 557.3G bluestore -osd4 sde3 464.2G bluestore -osd5 sdc3 557.3G bluestore Pod: osd-nxxnq Node: node3.zbrbdl -osd6 sda3 110.7G bluestore -osd17 sdd3 1.8T bluestore -osd18 sdb3 231.8G bluestore -osd19 sdc3 231.8G bluestore Pod: osd-tww1h Node: node2.zbrbdl -osd7 sdc3 464.2G bluestore -osd8 sdj3 557.3G bluestore -osd9 sdf3 66.7G bluestore -osd10 sdd3 464.2G bluestore -osd11 sdb3 147.4G bluestore -osd12 sdi3 557.3G bluestore -osd13 sdk3 557.3G bluestore -osd14 sde3 66.7G bluestore -osd15 sda3 110.2G bluestore -osd16 sdh3 135.1G bluestore
12.1.5 Separate storage groups #
Instead of manually needing to set this, the
deviceClass
property can be used on pool structures in
CephBlockPool
, CephFilesystem
and
CephObjectStore
CRD objects.
By default Rook-Ceph puts all storage under one replication rule in the CRUSH Map which provides the maximum amount of storage capacity for a cluster. If you would like to use different storage endpoints for different purposes, you need to create separate storage groups.
In the following example we will separate SSD drives from spindle-based drives, a common practice for those looking to target certain workloads onto faster (database) or slower (file archive) storage.
12.1.6 Configure pools #
12.1.6.1 Sizing placement groups #
Since Ceph Nautilus (v14.x), you can use the Ceph Manager
pg_autoscaler
module to auto-scale the PGs as needed.
If you want to enable this feature, refer to
Section 8.1.1.1, “Default PG and PGP counts”.
The general rules for deciding how many PGs your pool(s) should contain is:
Less than five OSDs: set
pg_num
to 128.Between 5 and 10 OSDs: set
pg_num
to 512.Between 10 and 50 OSDs: set
pg_num
to 1024.
If you have more than 50 OSDs, you need to know how to calculate the
pg_num
value by yourself. For calculating
pg_num
yourself, please make use of the pgcalc
tool at http://ceph.com/pgcalc/.
If you are already using a pool, it is generally safe to set
pg_count
on the fly (see
Section 12.1.6.2, “Setting PG count”). Decreasing the PG count is not
recommended on a pool that is in use. The safest way to decrease the PG
count is to back up the data, delete the pool, and recreate it.
12.1.6.2 Setting PG count #
Be sure to read the Section 12.1.6.1, “Sizing placement groups” section before changing the number of PGs.
# Set the number of PGs in the rbd pool to 512
cephuser@adm >
ceph osd pool set rbd pg_num 512
12.1.7 Creating custom ceph.conf
settings #
The advised method for controlling Ceph configuration is to manually use
the Ceph CLI or the Ceph Dashboard, because this offers the most
flexibility. We recommend that this is used only when absolutely
necessary, and that the config
is reset to an empty
string if or when the configurations are no longer necessary.
Configurations in the config file will make the Ceph cluster less
configurable from the CLI and Ceph Dashboard and may make future tuning or
debugging difficult.
Setting configs via Ceph's CLI requires that at least one MON is
available for the configs to be set, and setting configs via Ceph Dashboard
requires at least one MGR to be available. Ceph may also have a small
number of very advanced settings that are not able to be modified easily
via CLI or Ceph Dashboard. In order to set configurations before MONs are
available or to set problematic configuration settings, the
rook-config-override
ConfigMap exists, and the
config
field can be set with the contents of a
ceph.conf
file. The contents will be propagated to all
MON, MGR, OSD, MDS, and RGW daemons as an
/etc/ceph/ceph.conf
file.
Rook performs no validation on the config, so the validity of the settings is the user's responsibility.
If the rook-config-override
ConfigMap is created before
the cluster is started, the Ceph daemons will automatically pick up the
settings. If you add the settings to the ConfigMap after the cluster has
been initialized, each daemon will need to be restarted where you want the
settings applied:
MONs: ensure all three MONs are online and healthy before restarting each mon pod, one at a time.
MGRs: the pods are stateless and can be restarted as needed, but note that this will disrupt the Ceph dashboard during restart.
OSDs: restart your the pods by deleting them, one at a time, and running
ceph -s
between each restart to ensure the cluster goes back to “active/clean” state.RGW: the pods are stateless and can be restarted as needed.
MDS: the pods are stateless and can be restarted as needed.
After the pod restart, the new settings should be in effect. Note that if the ConfigMap in the Ceph cluster's namespace is created before the cluster is created, the daemons will pick up the settings at first launch.
12.1.7.1 Custom ceph.conf
example #
In this example we will set the default pool size
to
two, and tell OSD daemons not to change the weight of OSDs on startup.
Modify Ceph settings carefully. You are leaving the sandbox tested by Rook. Changing the settings could result in unhealthy daemons or even data loss if used incorrectly.
When the Rook Operator creates a cluster, a placeholder ConfigMap is created that will allow you to override Ceph configuration settings. When the daemon pods are started, the settings specified in this ConfigMap will be merged with the default settings generated by Rook.
The default override settings are blank. Cutting out the extraneous properties, we would see the following defaults after creating a cluster:
kubectl@adm >
kubectl -n rook-ceph get ConfigMap rook-config-override -o yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: rook-config-override
namespace: rook-ceph
data:
config: ""
To apply your desired configuration, you will need to update this ConfigMap. The next time the daemon pod(s) start, they will use the updated configs.
kubectl@adm >
kubectl -n rook-ceph edit configmap rook-config-override
Modify the settings and save. Each line you add should be indented from
the config
property as such:
apiVersion: v1 kind: ConfigMap metadata: name: rook-config-override namespace: rook-ceph data: config: | [global] osd crush update on start = false osd pool default size = 2
12.1.8 OSD CRUSH settings #
A useful view of the CRUSH Map (see 第17章 「保存データの管理」 for more details) is generated with the following command:
cephuser@adm >
ceph osd tree
In this section we will be tweaking some of the values seen in the output.
12.1.8.1 OSD weight #
The CRUSH weight controls the ratio of data that should be distributed to each OSD. This also means a higher or lower amount of disk I/O operations for an OSD with higher or lower weight, respectively.
By default, OSDs get a weight relative to their storage capacity, which maximizes overall cluster capacity by filling all drives at the same rate, even if drive sizes vary. This should work for most use-cases, but the following situations could warrant weight changes:
Your cluster has some relatively slow OSDs or nodes. Lowering their weight can reduce the impact of this bottleneck.
You are using BlueStore drives provisioned with Rook v0.3.1 or older. In this case, you may notice OSD weights did not get set relative to their storage capacity. Changing the weight can fix this and maximize cluster capacity.
This example sets the weight of osd.0
which is
600 GiB.
cephuser@adm >
ceph osd crush reweight osd.0 .600
12.1.8.2 OSD primary affinity #
When pools are set with a size setting greater than one, data is replicated between nodes and OSDs. For every chunk of data a Primary OSD is selected to be used for reading that data to be sent to clients. You can control how likely it is for an OSD to become a Primary using the Primary Affinity setting. This is similar to the OSD weight setting, except it only affects reads on the storage device, not capacity or writes.
In this example, we will make sure osd.0
is only
selected as Primary if all other OSDs holding replica data are
unavailable:
cephuser@adm >
osd primary-affinity osd.0 0
12.1.9 Removing phantom OSD #
If you have OSDs in which are not showing any disks, you can remove those “Phantom OSDs” by following the instructions below. To check for “Phantom OSDs”, you can run:
cephuser@adm >
ceph osd tree
An example output looks like this:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 57.38062 root default -13 7.17258 host node1.example.com 2 hdd 3.61859 osd.2 up 1.00000 1.00000 -7 0 host node2.example.com down 0 1.00000
The host node2.example.com
in the output has no disks,
so it is most likely a “Phantom OSD”.
Now to remove it, use the ID in the first column of the output and replace
<ID>
with it. In the example output above the ID
would be -7
. The commands are:
cephuser@adm >
ceph osd out IDcephuser@adm >
ceph osd crush remove osd.IDcephuser@adm >
ceph auth del osd.IDcephuser@adm >
ceph osd rm ID
To recheck that the phantom OSD was removed, re-run the following command and check if the OSD with the ID does not show up anymore:
ceph osd tree
12.1.10 Changing the failure domain #
In Rook, it is now possible to indicate how the default CRUSH failure
domain rule must be configured in order to ensure that replicas or erasure
code shards are separated across hosts, and a single host failure does not
affect availability. For instance, this is an example manifest of a block
pool named replicapool
configured with a
failureDomain
set to osd
:
apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: replicapool namespace: rook spec: # The failure domain will spread the replicas of the data across different failure zones failureDomain: osd [...]
However, due to several reasons, we may need to change such failure domain
to its other value: host
. Unfortunately, changing it
directly in the YAML manifest is not currently handled by Rook, so we
need to perform the change directly using Ceph commands using the Rook
tools pod, for instance:
cephuser@adm >
ceph osd pool get replicapool crush_rule crush_rule: replicapoolcephuser@adm >
ceph osd crush rule create-replicated replicapool_host_rule default host
Notice that the suffix host_rule
in the name of the rule
is just for clearness about the type of rule we are creating here, and can
be anything else as long as it is different from the existing one. Once the
new rule has been created, we simply apply it to our block pool:
cephuser@adm >
ceph osd pool set replicapool crush_rule replicapool_host_rule
And validate that it has been actually applied properly:
cephuser@adm >
ceph osd pool get replicapool crush_rule
crush_rule: replicapool_host_rule
If the cluster's health was HEALTH_OK
when we performed
this change, immediately, the new rule is applied to the cluster
transparently without service disruption.
Exactly the same approach can be used to change from
host
back to osd
.