Applies to SUSE Linux Enterprise Server 12 SP5

15 Tuning the Network #

Revision History: SUSE Linux Enterprise Serverマニュアル

The network subsystem is complex and its tuning highly depends on the system use scenario and on external factors such as software clients or hardware components (switches, routers, or gateways) in your network. The Linux kernel aims more at reliability and low latency than low overhead and high throughput. Other settings can mean less security, but better performance.

15.1 Configurable Kernel Socket Buffers #

Networking is largely based on the TCP/IP protocol and a socket interface for communication; for more information about TCP/IP, see 第16章「ネットワークの基礎」. The Linux kernel handles data it receives or sends via the socket interface in socket buffers. These kernel socket buffers are tunable.

Important: TCP Autotuning

Since kernel version 2.6.17 full autotuning with 4 MB maximum buffer size exists. This means that manual tuning usually will not improve networking performance considerably. It is often the best not to touch the following variables, or, at least, to check the outcome of tuning efforts carefully.

If you update from an older kernel, it is recommended to remove manual TCP tunings in favor of the autotuning feature.

The special files in the /proc file system can modify the size and behavior of kernel socket buffers; for general information about the /proc file system, see Section 2.6, “The /proc File System”. Find networking related files in:

/proc/sys/net/core
/proc/sys/net/ipv4
/proc/sys/net/ipv6

General net variables are explained in the kernel documentation (linux/Documentation/sysctl/net.txt). Special ipv4 variables are explained in linux/Documentation/networking/ip-sysctl.txt and linux/Documentation/networking/ipvs-sysctl.txt.

In the /proc file system, for example, it is possible to either set the Maximum Socket Receive Buffer and Maximum Socket Send Buffer for all protocols, or both these options for the TCP protocol only (in ipv4) and thus overriding the setting for all protocols (in core).

/proc/sys/net/ipv4/tcp_moderate_rcvbuf: If /proc/sys/net/ipv4/tcp_moderate_rcvbuf is set to 1, autotuning is active and buffer size is adjusted dynamically.
/proc/sys/net/ipv4/tcp_rmem: The three values setting the minimum, initial, and maximum size of the Memory Receive Buffer per connection. They define the actual memory usage, not only TCP window size.
/proc/sys/net/ipv4/tcp_wmem: The same as tcp_rmem, but for Memory Send Buffer per connection.
/proc/sys/net/core/rmem_max: Set to limit the maximum receive buffer size that applications can request.
/proc/sys/net/core/wmem_max: Set to limit the maximum send buffer size that applications can request.

Via /proc it is possible to disable TCP features that you do not need (all TCP features are switched on by default). For example, check the following files:

/proc/sys/net/ipv4/tcp_timestamps: TCP time stamps are defined in RFC1323.
/proc/sys/net/ipv4/tcp_window_scaling: TCP window scaling is also defined in RFC1323.
/proc/sys/net/ipv4/tcp_sack: Select acknowledgments (SACKS).

Use sysctl to read or write variables of the /proc file system. sysctl is preferable to cat (for reading) and echo (for writing), because it also reads settings from /etc/sysctl.conf and, thus, those settings survive reboots reliably. With sysctl you can read all variables and their values easily; as root use the following command to list TCP related settings:

sysctl -a | grep tcp

Note: Side-Effects of Tuning Network Variables

Tuning network variables can affect other system resources such as CPU or memory use.

15.2 Detecting Network Bottlenecks and Analyzing Network Traffic #

Before starting with network tuning, it is important to isolate network bottlenecks and network traffic patterns. There are some tools that can help you with detecting those bottlenecks.

The following tools can help analyzing your network traffic: netstat, tcpdump, and wireshark. Wireshark is a network traffic analyzer.

15.3 Netfilter #

The Linux firewall and masquerading features are provided by the Netfilter kernel modules. This is a highly configurable rule based framework. If a rule matches a packet, Netfilter accepts or denies it or takes special action (“target”) as defined by rules such as address translation.

There are quite a lot of properties Netfilter can take into account. Thus, the more rules are defined, the longer packet processing may last. Also advanced connection tracking could be rather expensive and, thus, slowing down overall networking.

When the kernel queue becomes full, all new packets are dropped, causing existing connections to fail. The 'fail-open' feature allows a user to temporarily disable the packet inspection and maintain the connectivity under heavy network traffic. For reference, see https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/.

For more information, see the home page of the Netfilter and iptables project, http://www.netfilter.org

15.4 Improving the Network Performance with Receive Packet Steering (RPS) #

Modern network interface devices can move so many packets that the host can become the limiting factor for achieving maximum performance. To keep up, the system must be able to distribute the work across multiple CPU cores.

Some modern network interfaces can help distribute the work to multiple CPU cores through the implementation of multiple transmission and multiple receive queues in hardware. However, others are only equipped with a single queue and the driver must deal with all incoming packets in a single, serialized stream. To work around this issue, the operating system must "parallelize" the stream to distribute the work across multiple CPUs. On SUSE Linux Enterprise Server this is done via Receive Packet Steering (RPS). RPS can also be used in virtual environments.

RPS creates a unique hash for each data stream using IP addresses and port numbers. The use of this hash ensures that packets for the same data stream are sent to the same CPU, which helps to increase performance.

RPS is configured per network device receive queue and interface. The configuration file names match the following scheme:

/sys/class/net/<device>/queues/<rx-queue>/rps_cpus

<device> stands for the network device, such as eth0, eth1. <rx-queue> stands for the receive queue, such as rx-0, rx-1.

If the network interface hardware only supports a single receive queue, only rx-0 will exist. If it supports multiple receive queues, there will be an rx-N directory for each receive queue.

These configuration files contain a comma-delimited list of CPU bitmaps. By default, all bits are set to 0. With this setting RPS is disabled and therefore the CPU that handles the interrupt will also process the packet queue.

To enable RPS and enable specific CPUs to process packets for the receive queue of the interface, set the value of their positions in the bitmap to 1. For example, to enable CPUs 0-3 to process packets for the first receive queue for eth0, set the bit positions 0-3 to 1 in binary: 00001111. This representation then needs to be converted to hex—which results in F in this case. Set this hex value with the following command:

echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus

If you wanted to enable CPUs 8-15:

1111 1111 0000 0000 (binary)
15     15    0    0 (decimal)
F       F    0    0 (hex)

The command to set the hex value of ff00 would be:

echo "ff00" > /sys/class/net/eth0/queues/rx-0/rps_cpus

On NUMA machines, best performance can be achieved by configuring RPS to use the CPUs on the same NUMA node as the interrupt for the interface's receive queue.

On non-NUMA machines, all CPUs can be used. If the interrupt rate is very high, excluding the CPU handling the network interface can boost performance. The CPU being used for the network interface can be determined from /proc/interrupts. For example:

root # cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
...
  51:  113915241          0          0          0      Phys-fasteoi   eth0
...

In this case, CPU 0 is the only CPU processing interrupts for eth0, since only CPU0 contains a non-zero value.

On x86 and AMD64/Intel 64 platforms, irqbalance can be used to distribute hardware interrupts across CPUs. See man 1 irqbalance for more details.