12 Power management #
Power management aims at reducing operating costs for energy and cooling systems while at the same time keeping the performance of a system at a level that matches the current requirements. Thus, power management is always a matter of balancing the actual performance needs and power saving options for a system. Power management can be implemented and used at different levels of the system. A set of specifications for power management functions of devices and the operating system interface to them has been defined in the Advanced Configuration and Power Interface (ACPI). As power savings in server environments can primarily be achieved at the processor level, this chapter introduces the main concepts and highlights a few tools for analyzing and influencing relevant parameters.
12.1 Power management at CPU Level #
At the CPU level, you can control power usage in several ways. For example, by using idling power states (C-states), changing CPU frequency (P-states), and throttling the CPU (T-states). The following sections give a short introduction to each approach and its significance for power savings. Detailed specifications can be found at https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf.
12.1.1 C-states (processor operating states) #
    Modern processors have several power saving modes called
    C-states. They reflect the capability of an idle
    processor to turn off unused components to save power.
   
    When a processor is in the C0 state, it is executing
    instructions. A processor running in any other C-state is idle. The
    higher the C number, the deeper the CPU sleep mode: more components are
    shut down to save power. Deeper sleep states can save large amounts of
    energy. Their downside is that they introduce latency. This means, it
    takes more time for the CPU to go back to C0.
    Depending on workload (threads waking up, triggering CPU usage and then
    going back to sleep again for a short period of time) and hardware (for
    example, interrupt activity of a network device), disabling the deepest
    sleep states can increase overall performance. For details
    on how to do so, refer to
    Section 12.3.2, “Viewing kernel idle statistics with cpupower”.
   
    Some states also have submodes with different power saving latency
    levels. Which C-states and submodes are supported depends on the
    respective processor. However, C1 is always
    available.
   
Table 12.1, “C-states” gives an overview of the most common C-states.
| Mode | Definition | 
|---|---|
| C0 | Operational state. CPU fully turned on. | 
| C1 | First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed. | 
| C2 | Stops CPU main internal clocks via hardware. State in which the processor maintains all software-visible states, but may take longer to wake up through interrupts. | 
| C3 | Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts. | 
    To avoid needless power consumption, it is recommended to test your
    workloads with deep sleep states enabled versus deep sleep states
    disabled. For more information, refer to
    Section 12.3.2, “Viewing kernel idle statistics with cpupower” or the
    cpupower-idle-set(1) man page.
   
12.1.2 P-states (processor performance states) #
    While a processor operates (in C0 state), it can be in one of several
    CPU performance states (P-states). Whereas C-states
    are idle states (all but C0), P-states are
    operational states that relate to CPU frequency and voltage.
   
    The higher the P-state, the lower the frequency and voltage at which the
    processor runs. The number of P-states is processor-specific and the
    implementation differs across the different types. However,
    P0 is always the highest-performance state (except for Section 12.1.3, “Turbo features”). Higher
    P-state numbers represent slower processor speeds and lower power
    consumption. For example, a processor in P3 state runs
    more slowly and uses less power than a processor running in the
    P1 state. To operate at any P-state, the processor
    must be in the C0 state, which means that it is
    working and not idling. The CPU P-states are also defined in the ACPI
    specification, see https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf.
   
C-states and P-states can vary independently of one another.
12.1.3 Turbo features #
    Turbo features allow to dynamically overtick active CPU
    cores while other cores are in deep sleep states. This increases the performance
    of active threads while still
    complying with Thermal Design Power (TDP) limits.
   
    However, the conditions under which a CPU core can use turbo frequencies
    are architecture-specific. Learn how to evaluate the efficiency of those
    new features in Section 12.3, “The cpupower tools”.
   
12.2 In-kernel governors #
The in-kernel governors belong to the Linux kernel CPUfreq infrastructure and can be used to dynamically scale processor frequencies at runtime. You can think of the governors as a sort of preconfigured power scheme for the CPU. The CPUfreq governors use P-states to change frequencies and lower power consumption. The dynamic governors can switch between CPU frequencies, based on CPU usage, to allow for power savings while not sacrificing performance.
The following governors are available with the CPUfreq subsystem:
- Performance governor
- The CPU frequency is statically set to the highest possible for maximum performance. Consequently, saving power is not the focus of this governor. 
- Powersave governor
- The CPU frequency is statically set to the lowest possible. This can have severe impact on the performance, as the system never rises above this frequency no matter how busy the processors are. An important exception is the - intel_pstatewhich defaults to the- powersavemode. This is due to a hardware-specific decision but functionally it operates similarly to the- on-demandgovernor.- However, using this governor often does not lead to the expected power savings as the highest savings can be achieved at idle through entering C-states. With the powersave governor, processes run at the lowest frequency and thus take longer to finish. This means it takes longer until the system can go into an idle C-state. - Tuning options: The range of minimum frequencies available to the governor can be adjusted (for example, with the - cpupowercommand line tool).
- On-demand governor
- The kernel implementation of a dynamic CPU frequency policy: The governor monitors the processor usage. When it exceeds a certain threshold, the governor sets the frequency to the highest available. If the usage is less than the threshold, the next lowest frequency is used. If the system continues to be underemployed, the frequency is again reduced until the lowest available frequency is set. 
Not all drivers use the in-kernel governors to dynamically scale power frequency at
    runtime. For example, the intel_pstate driver adjusts power frequency itself. Use
    the cpupower frequency-info command to find out which driver your system
    uses.
12.3 The cpupower tools #
The cpupower tools are designed to give an overview
   of all CPU power-related parameters that are supported
   on a given machine, including turbo (or boost) states. Use the toolset to
   view and modify settings of the kernel-related CPUfreq and cpuidle systems
   and other settings not related to frequency scaling or idle states. The
   integrated monitoring framework can access both kernel-related parameters
   and hardware statistics. Therefore, it is ideally suited for performance
   benchmarks. It also helps you to identify the dependencies between turbo and
   idle states.
   
 After installing the cpupower package, view the
   available cpupower subcommands with
   cpupower --help. Access the general man page with
   man cpupower, and the man pages of the subcommands with
   man cpupower-SUBCOMMAND.
   
12.3.1 Viewing current settings with cpupower #
     The cpupower frequency-info command shows the
     statistics of the cpufreq driver used in the kernel. Additionally, it
     shows if turbo (boost) states are supported and enabled in the BIOS.
     Run without any options, it shows an output similar to the following:
    
cpupower frequency-info ## cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 3.80 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 3.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 3.40 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
    3500 MHz max turbo 4 active cores
    3600 MHz max turbo 3 active cores
    3600 MHz max turbo 2 active cores
    3800 MHz max turbo 1 active cores
     To get the current values for all CPUs, use
     cpupower -c all frequency-info.
    
12.3.2 Viewing kernel idle statistics with cpupower #
     The idle-info subcommand shows the statistics of the
     cpuidle driver used in the kernel. It works on all architectures that
     use the cpuidle kernel framework.
    
cpupower idle-info ## cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
Analyzing CPU 0:
Number of idle states: 6
Available idle states: POLL C1-SNB C1E-SNB C3-SNB C6-SNB C7-SNB
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 163128
Duration: 17585669
C1-SNB:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 16170005
Duration: 697658910
C1E-SNB:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 4421617
Duration: 757797385
C3-SNB:
Flags/Description: MWAIT 0x10
Latency: 80
Usage: 2135929
Duration: 735042875
C6-SNB:
Flags/Description: MWAIT 0x20
Latency: 104
Usage: 53268
Duration: 229366052
C7-SNB:
Flags/Description: MWAIT 0x30
Latency: 109
Usage: 62593595
Duration: 324631233978
     After finding out which processor idle states are supported with
     cpupower idle-info, individual states can be
     disabled using the cpupower idle-set command.
     Typically one wants to disable the deepest sleep state, for example:
    
# cpupower idle-set -d 5Or, for disabling all CPUs with latencies equal to or higher than 80:
# cpupower idle-set -D 8012.3.3 Monitoring kernel and hardware statistics with cpupower #
     Use the monitor subcommand to report processor topology, and monitor frequency
     and idle power state statistics over a certain period of time. The
     default interval is 1 second, but it can be changed
     with the -i. Independent processor sleep states and
     frequency counters are implemented in the tool—some retrieved
     from kernel statistics, others reading out hardware registers. The
     available monitors depend on the underlying hardware and the system.
     List them with cpupower monitor -l.
     For a description of the individual monitors, refer to the
     cpupower-monitor man page.
    
     The monitor subcommand allows you to execute
     performance benchmarks. To compare kernel statistics with hardware
     statistics for specific workloads, concatenate the respective command, for example:
cpupowermonitordb_test.sh
cpupower monitor output ## cpupower monitor
|Mperf                   || Idle_Stats
 1                         2 
CPU | C0   | Cx   | Freq || POLL | C1   | C2   | C3
   0|  3.71| 96.29|  2833||  0.00|  0.00|  0.02| 96.32
   1| 100.0| -0.00|  2833||  0.00|  0.00|  0.00|  0.00
   2|  9.06| 90.94|  1983||  0.00|  7.69|  6.98| 76.45
   3|  7.43| 92.57|  2039||  0.00|  2.60| 12.62| 77.52| 
        Mperf shows the average frequency of a CPU, including boost
        frequencies, over time. Additionally, it shows the
        percentage of time the CPU has been active ( | |
| Idle_Stats shows the statistics of the cpuidle kernel subsystem. The kernel updates these values every time an idle state is entered or left. Therefore, there can be a few inaccuracies when cores are in an idle state for some time when the measure starts or ends. | 
      Apart from the (general) monitors in the example above, other
      architecture-specific monitors are available. For detailed
      information, refer to the cpupower-monitor man
      page.
     
     By comparing the values of the individual monitors, you can find
     correlations and dependencies and evaluate how well the power saving
     mechanism works for a certain workload. In
     Example 12.3 you can
     see that CPU 0 is idle (the value of
     Cx is near 100%), but runs at a high frequency.
     This is because the CPUs 0 and 1
     have the same frequency values which means that there is a dependency
     between them.
    
12.3.4 Modifying current settings with cpupower #
     You can use
     cpupower frequency-set command as root to
     modify current settings. It allows you to set values for the minimum or
     maximum CPU frequency the governor may select or to create a new
     governor. With the -c option, you can also specify for
     which of the processors the settings should be modified. That makes it
     easy to use a consistent policy across all processors without adjusting
     the settings for each processor individually. For more details and the
     available options, see the man page
     cpupower-frequency-set or run
     cpupower frequency-set
     --help.
    
12.4 Special tuning options #
The following sections highlight important settings.
12.4.1 Tuning options for P-states #
The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters.
    To switch to another governor at runtime, use
    cpupower frequency-set with the -g option. For
    example, running the following command (as root) will activate the
    performance governor:
   
# cpupower frequency-set -g performance
    To set values for the minimum or maximum CPU frequency the governor may
    select, use the -d or -u option,
    respectively.
   
12.5 Troubleshooting #
- BIOS options enabled?
- To use C-states or P-states, check your BIOS options: - To use C-states, make sure to enable - CPU C Stateor similar options to benefit from power savings at idle.
- To use P-states and the CPUfreq governors, make sure to enable - Processor Performance Statesoptions or similar.
- Even if P-states and C-states are available, it is possible that the platform firmware is managing CPU frequencies which may be sub-optimal. For example, if - pcc-cpufreqis loaded then the OS is only giving hints to the firmware, which is free to ignore the hints. This can be addressed by selecting "OS Management" or similar for CPU frequency managed in the BIOS. After reboot, an alternative driver will be used but the performance impact should be carefully measured.
 - In case of a CPU upgrade, make sure to upgrade your BIOS, too. The BIOS needs to know the new CPU and its frequency stepping to pass this information on to the operating system. 
- Log file information?
- Check the - systemdjournal (see 第21章 「- journalctl:- systemdジャーナルのクエリ」) for any output regarding the CPUfreq subsystem. Only severe errors are reported there.- If you suspect problems with the CPUfreq subsystem on your machine, you can also enable additional debug output. To do so, either use - cpufreq.debug=7as boot parameter or execute the following command as- root:- #echo 7 > /sys/module/cpufreq/parameters/debug- This will cause CPUfreq to log more information to - dmesgon state transitions, which is useful for diagnosis. But as this additional output of kernel messages can be rather comprehensive, use it only if you are sure that a problem exists.
12.6 More information #
Platforms with a Baseboard Management Controller (BMC) may have additional power management configuration options accessible via the service processor. These configurations are vendor specific and therefore not subject of this guide. For more information, refer to the manuals provided by your vendor.
12.7 Monitoring power consumption with powerTOP #
powerTOP helps to identify the causes of unnecessary high power consumption. This is especially useful for laptops, where minimizing power consumption is more important. It supports both Intel and AMD processors. Install it in the usual way:
>sudozypper in powertop
powerTOP combines several sources of information (analysis of programs, device drivers, kernel options, number and sources of interrupts waking up processors from sleep states) and provides several ways of viewing them. You can launch it in interactive mode, which runs in an ncurses session (see Figure 12.1, “powerTOP in interactive mode”):
>sudopowertop
powerTOP supports exporting reports to HTML and CSV. The following example generates a single report of a 240-second run:
>sudopowertop --iteration=1 --time=240 --html=POWERREPORT.HTML
It can be useful to run separate reports over time. The following example runs powerTOP 10 times for 20 seconds each time, and creates a separate HTML report for each run:
>sudopowertop --iteration=10 --time=20 --html=POWERREPORT.HTML
This creates 10 time-stamped reports:
powerreport-20200108-104512.html powerreport-20200108-104451.html powerreport-20200108-104431.html [...]
An HTML report looks like Figure 12.2, “HTML powerTOP report”:
   The Tuning tab of the HTML reports, and the Tunables tab in the
   interactive mode, both provide commands for testing the various power
   settings. The HTML report prints the commands, which you can copy
   to a root command line for testing, for example
   echo '0' > '/proc/sys/kernel/nmi_watchdog'.
   The ncurses mode provides a simple toggle between Good
   and Bad. Good runs a command
   to enable power saving, and Bad turns off power saving.
   Enable all powerTOP settings with one command:
  
>sudopowertop --auto-tune
   None of these changes survive a reboot. To make any changes
   permanent, use sysctl, udev,
   or systemd to run your selected commands at
   boot. powerTOP includes a systemd service file,
   /usr/lib/systemd/system/powertop.service. This
   starts powerTOP with the --auto-tune option:
  
ExecStart=/usr/sbin/powertop --auto-tune
   Test this carefully before launching the systemd service,
   to see if it gives the results that you want.
   You should not use USB keyboards, and mice should not enter power save mode to avoid constantly
   waking them up and disturbing other devices. For easier testing and configuration editing,
   extract the commands from an HTML report with awk:
  
> awk -F '</?td ?>' '/tune/ { print $4 }' POWERREPORT.HTMLIn calibrate mode, powerTOP sets up several runs that use different idle settings for backlight, CPU, Wi-Fi, USB devices, and disks, and helps to identify optimal brightness settings on battery power:
>sudopowertop --calibrate
You may call a file that creates a workload for more accurate calibration:
>sudopowertop --calibrate --workload=FILENAME --html=POWERREPORT.HTML
For more information, see:
- The powerTOP project page at https://01.org/powertop 

