12 Tuning I/O Performance #
I/O scheduling controls how input/output operations will be submitted to
storage. SUSE Linux Enterprise Server offers various I/O algorithms—called
elevators
—suiting different workloads.
Elevators can help to reduce seek operations and can prioritize I/O requests.
Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too. Single ATA disk systems, SSDs, RAID arrays, or network storage systems, for example, each require different tuning strategies.
12.1 Switching I/O Scheduling #
SUSE Linux Enterprise Server picks a default I/O scheduler at boot-time, which can be changed on the fly per block device. This makes it possible to set different algorithms, for example, for the device hosting the system partition and the device hosting a database.
The default I/O scheduler is chosen for each device based on whether the
device reports to be rotational disk or not. For non-rotational disks
DEADLINE
I/O scheduler is picked.
Other devices default to
CFQ
(Completely Fair Queuing).
To change this default, use the following boot parameter:
elevator=SCHEDULER
Replace SCHEDULER with one of the values
cfq
, noop
, or
deadline
. See Section 12.2, “Available I/O Elevators”
for details.
To change the elevator for a specific device in the running system, run the following command:
echo SCHEDULER > /sys/block/DEVICE/queue/scheduler
Here, SCHEDULER is one of
cfq
, noop
, or deadline
.
DEVICE is the block device
(sda
for example). Note that this change will not
persist during reboot. For permanent I/O scheduler change for a particular
device either place the command switching the I/O scheduler into init
scripts or add appropriate udev rule into
/lib/udev/rules.d/
. See
/lib/udev/rules.d/60-ssd-scheduler.rules
for an example
of such tuning.
On IBM z Systems, the default I/O scheduler for a storage device is set by the device driver.
The elevator
boot parameter does not apply to
devices using blk-mq I/O path (refer to Section 12.5, “Enable blk-mq I/O Path for SCSI by Default”).
Default elevator is mq-deadline
for conventional
devices (for example, regular hard disks, SSDs) and
none
for faster storage devices (devices with
multiple hardware queues).
It is possible to change the elevator for a specific device in a running
system. SCHEDULER should be set to either
mq-deadline
, kyber
,
bfq
, or none
12.2 Available I/O Elevators #
Below is a list of elevators available on SUSE Linux Enterprise Server for devices that use the legacy block I/O path. If an elevator has tunable parameters, they can be set with the command:
echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE
where VALUE is the desired value for the TUNABLE and DEVICE the block device.
To find out what elevators are available for a device
(sda
for example), run the following
command (the currently selected scheduler is listed in brackets):
tux >
cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
If this file contains different strings it usually means that the device uses blk-mq I/O path (refer to Section 12.5, “Enable blk-mq I/O Path for SCSI by Default” and Section 12.3, “Available I/O Elevators with blk-mq I/O path”).
12.2.1 CFQ
(Completely Fair Queuing) #
CFQ
is a fairness-oriented
scheduler and is used by default on SUSE Linux Enterprise Server. The algorithm
assigns each thread a time slice in which it is allowed to submit I/O to
disk. This way each thread gets a fair share of I/O throughput. It also
allows assigning tasks I/O priorities which are taken into account
during scheduling decisions (see
Section 8.3.3, “Prioritizing Disk Access with ionice
”). The
CFQ
scheduler has the
following tunable parameters:
CFQ
tunable parameters #File | Description |
---|---|
| When a task has no more I/O to submit in its
time slice, the I/O scheduler waits before
scheduling the next thread.
For media where locality is less important (SSDs,
SANs with lots of disks), setting
Default is |
| Same as Default is |
| This option limits the maximum number of
requests that are being processed by the
device. For a storage with several disks, this setting can
unnecessarily limit parallel processing of
requests. Therefore, increasing the value can improve
performance. However, it can also cause latency of certain
I/O operations to increase, because more requests are
buffered inside the storage. When changing this value, you
can also consider tuning
Default is |
| When enabled (which is the default on
SUSE Linux Enterprise Server), the scheduler may dynamically adjust the
length of the time slice by aiming to meet a tuning
parameter called the
Default is |
| Contains an estimated latency time in
milliseconds for Default is |
| Same as Default is |
| To avoid starving of blkio cgroups doing
dependent I/O, Default is |
| Same as Default is |
| This parameter is used to calculate the time slice for synchronous queue. It is specified in milliseconds. Increasing this value increases the time slice of synchronous queue. Default is |
| Same as Default is |
| This parameter is used to calculate the time slice for asynchronous queue. It is specified in milliseconds. Increasing this value increases the time slice of asynchronous queue. Default is |
| Same as Default is |
| This limits the maximum number of asynchronous requests—usually write requests—that are submitted in one time slice. Default is |
| Maximum "distance" (in Kbytes) for backward seeking. Default is |
| Used to compute the cost of backward seeking. Default is |
| Value (in milliseconds) is used to set the timeout of asynchronous requests. Default is |
| Value (in milliseconds) that specifies the timeout of synchronous requests. Default is |
CFQ
#
In SUSE Linux Enterprise Server 12 SP4, the low_latency
tuning parameter is enabled by default to ensure that processes get fair
access within a bounded length of time. (Note that this parameter was not
enabled in versions prior to SUSE Linux Enterprise
12.)
This is usually preferred in a server scenario where processes are executing I/O as part of transactions, as it makes the time needed for each transaction predictable. However, there are scenarios where that is not the desired behavior:
If the performance metric of interest is the peak performance of a single process when there is I/O contention.
If a workload must complete as quickly as possible and there are multiple sources of I/O. In this case, unfair treatment from the I/O scheduler may allow the transactions to complete faster: Processes take their full slice and exit quickly, resulting in reduced overall contention.
To address this, there are two options—increase
target_latency
or disable
low_latency
. As with all tuning parameters it is
important to verify your workload behaves as expected before and after
the tuning modification. Take careful note of whether your workload
depends on individual process peak performance or scales better with
fairness. It should also be noted that the performance will depend on
the underlying storage and the correct tuning option for one
installation may not be universally true.
Find below an example that does not control when I/O starts but is
simple enough to demonstrate the point. 32 processes are writing a
small amount of data to disk in parallel. Using the SUSE Linux Enterprise Server
default (enabling low_latency
), the result looks as
follows:
root #
echo 1 > /sys/block/sda/queue/iosched/low_latencyroot #
time ./dd-test.sh 10485760 bytes (10 MB) copied, 2.62464 s, 4.0 MB/s 10485760 bytes (10 MB) copied, 3.29624 s, 3.2 MB/s 10485760 bytes (10 MB) copied, 3.56341 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.56908 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.53043 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.57511 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.53672 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.5433 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.65474 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.63694 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.90122 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.88507 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.86135 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.84553 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.88871 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.94943 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 4.12731 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.15106 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.21601 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.35004 s, 2.4 MB/s 10485760 bytes (10 MB) copied, 4.33387 s, 2.4 MB/s 10485760 bytes (10 MB) copied, 4.55434 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.52283 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.52682 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.56176 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.62727 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.78958 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.79772 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.78004 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.77994 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.86114 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.88062 s, 2.1 MB/s real 0m4.978s user 0m0.112s sys 0m1.544s
Note that each process completes in similar times. This is the
CFQ
scheduler meeting its
target_latency
: Each process has fair access
to storage.
Note that the earlier processes complete somewhat faster. This happens because the start time of the processes is not identical. In a more complicated example, it is possible to control for this.
This is what happens when low_latency is disabled:
root #
echo 0 > /sys/block/sda/queue/iosched/low_latencyroot #
time ./dd-test.sh 10485760 bytes (10 MB) copied, 0.813519 s, 12.9 MB/s 10485760 bytes (10 MB) copied, 0.788106 s, 13.3 MB/s 10485760 bytes (10 MB) copied, 0.800404 s, 13.1 MB/s 10485760 bytes (10 MB) copied, 0.816398 s, 12.8 MB/s 10485760 bytes (10 MB) copied, 0.959087 s, 10.9 MB/s 10485760 bytes (10 MB) copied, 1.09563 s, 9.6 MB/s 10485760 bytes (10 MB) copied, 1.18716 s, 8.8 MB/s 10485760 bytes (10 MB) copied, 1.27661 s, 8.2 MB/s 10485760 bytes (10 MB) copied, 1.46312 s, 7.2 MB/s 10485760 bytes (10 MB) copied, 1.55489 s, 6.7 MB/s 10485760 bytes (10 MB) copied, 1.64277 s, 6.4 MB/s 10485760 bytes (10 MB) copied, 1.78196 s, 5.9 MB/s 10485760 bytes (10 MB) copied, 1.87496 s, 5.6 MB/s 10485760 bytes (10 MB) copied, 1.9461 s, 5.4 MB/s 10485760 bytes (10 MB) copied, 2.08351 s, 5.0 MB/s 10485760 bytes (10 MB) copied, 2.28003 s, 4.6 MB/s 10485760 bytes (10 MB) copied, 2.42979 s, 4.3 MB/s 10485760 bytes (10 MB) copied, 2.54564 s, 4.1 MB/s 10485760 bytes (10 MB) copied, 2.6411 s, 4.0 MB/s 10485760 bytes (10 MB) copied, 2.75171 s, 3.8 MB/s 10485760 bytes (10 MB) copied, 2.86162 s, 3.7 MB/s 10485760 bytes (10 MB) copied, 2.98453 s, 3.5 MB/s 10485760 bytes (10 MB) copied, 3.13723 s, 3.3 MB/s 10485760 bytes (10 MB) copied, 3.36399 s, 3.1 MB/s 10485760 bytes (10 MB) copied, 3.60018 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.58151 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.67385 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.69471 s, 2.8 MB/s 10485760 bytes (10 MB) copied, 3.66658 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.81495 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 4.10172 s, 2.6 MB/s 10485760 bytes (10 MB) copied, 4.0966 s, 2.6 MB/s real 0m3.505s user 0m0.160s sys 0m1.516s
Note that the time processes take to complete is spread much wider as processes are not getting fair access. Some processes complete faster and exit, allowing the total workload to complete faster, and some processes measure higher apparent I/O performance. It is also important to note that this example may not behave similarly on all systems as the results depend on the resources of the machine and the underlying storage.
It is important to emphasize that neither tuning option is inherently better than the other. Both are best in different circumstances and it is important to understand the requirements of your workload and tune accordingly.
12.2.2 NOOP
#
A trivial scheduler that only passes down the I/O that comes to it. Useful for checking whether complex I/O scheduling decisions of other schedulers are causing I/O performance regressions.
This scheduler is recommended for setups with devices that do I/O scheduling themselves, such as intelligent storage or in multipathing environments. If you choose a more complicated scheduler on the host, the scheduler of the host and the scheduler of the storage device compete with each other. This can decrease performance. The storage device can usually determine best how to schedule I/O.
For similar reasons, this scheduler is also recommended for use within virtual machines.
The NOOP
scheduler can be
useful for devices that do not depend on mechanical movement, like SSDs.
Usually, the
DEADLINE
I/O scheduler is a
better choice for these devices. However,
NOOP
creates less overhead and
thus can on certain workloads increase performance.
There are no tunable parameters for NOOP
.
12.2.3 DEADLINE
#
DEADLINE
is a latency-oriented
I/O scheduler. Each I/O request is assigned a deadline. Usually,
requests are stored in queues (read and write) sorted by sector numbers.
The DEADLINE
algorithm
maintains two additional queues (read and write) in which requests are
sorted by deadline. As long as no request has timed out, the
“sector” queue is used. When timeouts occur, requests from
the “deadline” queue are served until there are no more
expired requests. Generally, the algorithm prefers reads over writes.
This scheduler can provide a superior throughput over the
CFQ
I/O scheduler in cases
where several threads read and write and fairness is not an issue. For
example, for several parallel readers from a SAN and for databases
(especially when using “TCQ” disks). The
DEADLINE
scheduler has the
following tunable parameters:
DEADLINE
tunable parameters #File | Description |
---|---|
| Controls how many times reads are preferred
over writes. A value of Default is |
| Sets the deadline (current time plus the
Default is |
| Sets the deadline (current time plus the
Default is |
| Enables (1) or disables (0) attempts to front merge requests. Default is |
| Sets the maximum number of requests per batch
(deadline expiration is only checked for batches). This
parameter allows to balance between latency and
throughput. When set to Default is |
12.3 Available I/O Elevators with blk-mq I/O path #
Below is a list of elevators available on SUSE Linux Enterprise Server for devices that use the blk-mq I/O path If an elevator has tunable parameters, they can be set with the command:
echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE
In command above, VALUE is the desired value for the TUNABLE and DEVICE is the block device.
To find out what elevators are available for a device
(sda
for example), run the following
command (the currently selected scheduler is listed in brackets):
tux >
cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
When switching from legacy block to blk-mq I/O path for a device,
the none
option is roughly comparable to
noop
, mq-deadline
is comparable
to deadline
, and bfq
is
comparable to cfq
.
12.3.1 MQ-DEADLINE
#
MQ-DEADLINE
is a
latency-oriented I/O scheduler. It is a modification of
DEADLINE
scheduler for
blk-mq I/O path (refer to Section 12.2.3, “DEADLINE
”). MQ-DEADLINE
has the same set of
tunable parameters. Please refer to Table 12.2, “DEADLINE
tunable parameters” for a description.
12.3.2 NONE
#
When NONE
is selected
as I/O elevator option for blk-mq, no I/O scheduler
is used, and I/O requests are passed down to the
device without further I/O scheduling interaction. In this respect,
it is comparable to NOOP
scheduler for the legacy block I/O path (see Section 12.2.2, “NOOP
”).
NONE
is the default for
NVM Express devices. With no overhead compared to other I/O
elevator options, it is considered the fastest way of passing down
I/O requests on multiple queues to such devices.
There are no tunable parameters for NONE
.
12.3.3 BFQ
(Budget Fair Queueing) #
BFQ
is a
fairness-oriented scheduler. It is described as "a
proportional-share storage-I/O scheduling algorithm based on the
slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
measured in number of sectors, to processes instead of time
slices." (Source:
linux-4.12/block/bfq-iosched.c)
BFQ
allows to assign
I/O priorities to tasks which are taken into account during
scheduling decisions (see Section 8.3.3, “Prioritizing Disk Access with ionice
”).
BFQ
scheduler has
following tunable parameters:
BFQ
tunable parameters #File | Description |
---|---|
| Value in milliseconds specifies how long to idle, waiting for next request on an empty queue. Default is |
| Same as Default is |
| Enables (1) or disables (0) Default is |
| Maximum value (in Kbytes) for backward seeking. Default is |
| Used to compute the cost of backward seeking. Default is |
| Value (in milliseconds) is used to set the timeout of asynchronous requests. Default is |
| Value in milliseconds specifies the timeout of synchronous requests. Default is |
| Maximum time in milliseconds that a task (queue) is serviced after it has been selected. Default is |
| Limit for number of sectors that are served
at maximum within Default is |
| Enables (1) or disables (0) Default is |
12.3.4 KYBER
#
KYBER
is a
latency-oriented I/O scheduler. It makes it possible to set target latencies
for reads and synchronous writes and throttles I/O requests in
order to try to meet these target latencies.
KYBER
tunable parameters #File | Description |
---|---|
| Sets the target latency for read operations in nanoseconds. Default is |
| Sets the target latency for write operations in nanoseconds. Default is |
12.4 I/O Barrier Tuning #
Most file systems (such as XFS, Ext3, Ext4, or ReiserFS) send write barriers to disk after fsync or during transaction commits. Write barriers enforce proper ordering of writes, making volatile disk write caches safe to use (at some performance penalty). If your disks are battery-backed in one way or another, disabling barriers can safely improve performance.
Sending write barriers can be disabled using the
nobarrier
mount option.
Disabling barriers when disks cannot guarantee caches are properly written in case of power failure can lead to severe file system corruption and data loss.
12.5 Enable blk-mq I/O Path for SCSI by Default #
Block multiqueue (blk-mq) is a multi-queue block I/O queueing mechanism. Blk-mq uses per-cpu software queues to queue I/O requests. The software queues are mapped to one or more hardware submission queues. Blk-mq significantly reduces lock contention. In particular blk-mq improves performance for devices that support a high number of input/output operations per second (IOPS). Blk-mq is already the default for some devices, for example, NVM Express devices.
Blk-mq has a different set of I/O scheduler options. There is
MQ-DEADLINE
(comparable
to DEADLINE
) and
NONE
(comparable to
NOOP
). There is no longer
CFQ
I/O scheduler with
blk-mq. But there are two new I/O schedulers:
BFQ
and KYBER
.
These changes in I/O scheduling can cause performance differences
with blk-mq compared to legacy block I/O path. Therefore,
blk-mq is not enabled by default for SCSI devices.
If you have fast SCSI devices (for example, SSDs) instead of SCSI
hard disks attached to your system, consider switching to blk-mq
for SCSI. This is done using the kernel command line option
scsi_mod.use_blk_mq=1
.
If you have also attached SCSI hard disks (spinning devices) to
your system, make sure to switch to BFQ
I/O scheduler for the spinning
devices to avoid their significant performance degradation.