documentation.suse.com / Advanced Optimization and New Capabilities of GCC 11
Documentation survey

Advanced Optimization and New Capabilities of GCC 11

SUSE Best Practices

Development Tools

Authors
Martin Jambor, Toolchain Developer (SUSE)
Jan Hubička, Toolchain Developer (SUSE)
Richard Biener, Toolchain Developer (SUSE)
Martin Liška, Toolchain Developer (SUSE)
Michael Matz, Toolchain Team Lead (SUSE)
Brent Hollingsworth, Engineering Manager (AMD)
SUSE logo
SUSE Linux Enterprise Server 15 SP3
Development Tools Module
Date: 2022-03-15

The document at hand provides an overview of GCC 11 as the current Development Tools Module compiler in SUSE Linux Enterprise 15 SP3. It focuses on the important optimization levels and options Link Time Optimization (LTO) and Profile Guided Optimization (PGO). Their effects are demonstrated by compiling the SPEC CPU benchmark suite for AMD EPYC 7003 Series Processors and building Mozilla Firefox for a generic x86_64 machine.

Disclaimer: Documents published as part of the SUSE Best Practices series have been contributed voluntarily by SUSE employees and third parties. They are meant to serve as examples of how particular actions can be performed. They have been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. SUSE cannot verify that actions described in these documents do what is claimed or whether actions described have unintended consequences. SUSE LLC, its affiliates, the authors, and the translators may not be held liable for possible errors or the consequences thereof.

1 Overview

The first release of the GNU Compiler Collection (GCC) with the major version 11, GCC 11.1, took place in March 2021. In late May of the same year, the entire openSUSE Tumbleweed Linux distribution was rebuilt with it and shipped to users. GCC 11.2, with fixes to over 95 bugs, was released in July of the same year. Subsequently, it has replaced the compiler in the SUSE Linux Enterprise (SLE) Development Tools Module. GCC 11 comes with many new features, such as implementing parts of the most recent versions of specifications of various languages (especially C2X, C++17, C++20) and their extensions (OpenMP, OpenACC), supporting new capabilities of a wide range of computer architectures and numerous generic optimization improvements.

This document gives an overview of GCC 11. It focuses on selecting appropriate optimization options for your application and stresses the benefits of advanced modes of compilation. First, we describe the optimization levels the compiler offers and other important options developers often use. We explain when and how you can benefit from using Link Time Optimization (LTO) and Profile Guided Optimization (PGO) builds. We also detail their effects when building a set of well known CPU intensive benchmarks, and we look at how these perform on AMD Zen 3 based EPYC 7003 Series Processors. Finally, we take a closer look at the effects they have on a big software project: Mozilla Firefox.

2 System compiler versus Development Tools Module compiler

The major version of the system compiler in SUSE Linux Enterprise 15 remains to be GCC 7, regardless of the service pack level. This is to minimize the danger of any unintended changes over the entire life time of the product.

sles15: # gcc --version
gcc (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

That does not mean that, as a user of SUSE Linux Enterprise 15, you are forced to use a compiler with features frozen in 2016. You can install an add-on module called Development Tools Module. This module is included in the SUSE Linux Enterprise Server 15 subscription and contains a much newer compiler.

At the time of writing this document, the compiler included in the Development Tools Module is GCC 11.2. Nevertheless, it is important to stress that, unlike the system compiler, the major version of the most recent GCC from the module will change shortly after the upstream release of GCC 12.2 (most likely in summer 2022), GCC 13.2 (summer 2023) and so forth. Note that only the most recent compiler in the Development Tools Module is supported at any time, except for a six months overlap period after the upgrade happened. Developers on a SUSE Linux Enterprise Server 15 system therefore have always access to two supported GCC versions: the almost unchanging system compiler and the most recent compiler from the Development Tools Module.

Programs and libraries built with the compiler from the Development Tools Module can run on computers running SUSE Linux Enterprise Server 15 which do not have the module installed. All necessary runtime libraries are available from the main repositories of the operating system itself, and new ones are added through the standard update mechanism. In the document at hand, we use the term GCC 11 as synonym for any minor version of the major version 11 and GCC 11.2, to refer to specifically that version. In practice, they should be interchangeable.

2.1 When to use compilers from the Development Tools Module

Often you will find that the system compiler perfectly satisfies your needs. After all, it is the compiler used to build the vast majority of packages and their updates in the system itself. On the other hand, there are situations where a newer compiler is necessary, or where you want to consider using a newer compiler to get some benefits of its ongoing development.

If the program or library you are building uses language features which are not supported by GCC 7, you cannot use the system compiler. However, the compiler from the Development Tools Module will usually be sufficiently new. The most obvious case is C++. GCC 11 has a mature implementation of C++17 features, whereas the one in GCC 7 is only experimental and incomplete. The GNU C++ Library which accompanies GCC 11 is also almost C++17 feature-complete. Only hardware interference sizes [1] are not implemented.

Important
Important: Code using C++17 features

Code using C++17 features should always be compiled with the compiler from the Development Tools Module. Linking two objects, such as an application and a shared library, which both use C++17, one was built with g++ 8 or earlier and the other with g++ 9 or later is particularly dangerous because C++ STL objects instantiated by the experimental code may provide implementation and even ABI that is different from what the mature implementation expects and vice versa. Issues caused by such a mismatch are difficult to predict and may include silent data corruption.

Most of C++20 features are implemented in GCC 11 as experimental features. Try them out with appropriate caution and avoid linking together code that uses them and is produced by different compilers. Modules are only partially implemented [2] and require that the source file is compiled with -⁠fmodules-ts option. Similarly, coroutines [3] are also implemented but require that the source file is compiled with the -⁠fcoroutines switch. If you are interested in the implementation status of any particular C++ feature in the compiler, consult the following pages:

Advances in supporting new language specifications are not limited to C++. GCC 11 supports several new features from the ISO 202X C standard draft, and the Fortran compiler has also seen many improvements. And if you use OpenMP or OpenACC extensions for parallel programming, you will find that the compiler supports a lot of features of new versions of these standards. For more details, visit the links at the end of this section.

In addition to new supported language constructs, GCC 11 offers improved diagnostics when it reports errors and warnings to the user so that they are easier to understand and to be acted upon. This is particularly useful when dealing with issues in templated C++ code. Furthermore, there are several new warnings which help to avoid common programming mistakes.

Because GCC 11 is newer, it can generate code for many recent processors not supported by GCC 7. Such a list of processors would be too large to be displayed here. Nevertheless, in Section 7, “Performance evaluation: SPEC CPU 2017” we specifically look at optimizing code for an AMD EPYC 7003 Series Processor which is based on AMD Zen 3 cores. The system compiler does not know this kind of core and therefore cannot optimize for it. GCC 11, on the other hand, is the second major release supporting AMD Zen 3 cores, and thus can often produce significantly faster code for it.

Finally, the general optimization pipeline of the compiler has also significantly improved over the years, which we will demonstrate in the last sections of this document. To find out more about improvements in versions of GCC 8, 9, 10 and 11, visit their respective changes pages:

2.2 Potential issues with the Development Tools Module Compiler

GCC 11 from the Development Tools Module can sometimes behave differently in a way that can cause issues which were not present with the system compiler. Such problems encountered by other users are listed in the following documents:

To get an understanding of the problems, read through these four short pages. The document at hand briefly mentions three such potential pitfalls.

The first one is that, for performance reasons, GCC 10 and later default to -⁠fno-common which means that a linker error will now be reported if the same variable is defined in two C compilation units. This can happen if two or more .c files include the same header file which intends to declare a variable but omits the extern keyword when doing so, inadvertently resulting in multiple definitions. If you encounter such an error, you simply need to add the extern keyword to the declaration in the header file and define the variable in only a single compilation unit. Alternatively, you can compile your project with an explicit -⁠fcommon if you are willing to accept that this behavior is inconsistent with C++ and may incur speed and code size penalties.

Users compiling C++ sources should also be aware that g++ 11 defaults to -std=gnu++17, the C++17 standard with GNU extensions, instead of -std=gnu++14. Moreover, some C++ Standard Library headers have been changed to no longer include other headers that they do not depend on. You may need to explicitly include <limits>, <memory>, <utility> or <thread>.

The final issue emphasized here is that the C++ compiler in GCC 8 and later now assumes that no execution path in a non-void function simply reaches the end of the function without a return statement. This means it is assumed that such code paths will never be executed, and thus they will be eliminated. You should therefore pay special attention to warnings produced by -Wreturn-type. This option is enabled by default and indicates which functions might be affected.

2.3 Installing GCC 11 from the Development Tools Module

Similar to other modules and extensions for SUSE Linux Enterprise Server 15, you can activate the Development Tools Module either using the command line tool SUSEConnect or using the YaST setup and configuration tool. To use the former, carry out the following steps:

  1. As root, start by listing the available and activated modules and extensions:

    sles15: # SUSEConnect --list-extensions
  2. In the computer output, look for Development Tools Module:

                Development Tools Module 15 SP3 x86_64
                Activate with: SUSEConnect -p sle-module-development-tools/15.3/x86_64

    If you see the text (Activated) next to the module name, the module is already ready to be used. You can safely proceed to the installation of the compiler packages.

  3. Otherwise, issue the activation command that is shown in the command output above:

    sles15: # SUSEConnect -p sle-module-development-tools/15.3/x86_64
    Registering system to SUSE Customer Center
    
    Updating system details on https://scc.suse.com ...
    
    Activating sle-module-development-tools 15.3 x86_64 ...
    -> Adding service to system ...
    -> Installing release package ...
    
    Successfully registered system

If you prefer to use YaST, the procedure is also straightforward. Run YaST as root and go to the Add-On Products menu in the Software section. If Development Tools Module is among the listed installed modules, you already have the module activated and can proceed with installing individual compiler packages. If not, click the Add button, select Select Extensions and Modules from Registration Server, and YaST will guide you through a simple procedure to add the module.

When you have the Development Tools Module installed, you can verify that the GCC 11 packages are available to be installed on your system:.

sles15: # zypper search gcc11
Refreshing service 'Basesystem_Module_15_SP3_x86_64'.
Refreshing service 'Containers_Module_15_SP3_x86_64'.
Refreshing service 'Desktop_Applications_Module_15_SP3_x86_64'.
Refreshing service 'Development_Tools_Module_15_SP3_x86_64'.
Refreshing service 'SUSE_Linux_Enterprise_Server_15_SP3_x86_64'.
Refreshing service 'Server_Applications_Module_15_SP3_x86_64'.
Refreshing service 'Web_and_Scripting_Module_15_SP3_x86_64'.
Loading repository data...
Reading installed packages...

S | Name                         | Summary
--+------------------------------+-------------------------------------------------------
  | gcc11                        | The GNU C Compiler and Support Files
  | gcc11                        | The GNU C Compiler and Support Files
  | gcc11-32bit                  | The GNU C Compiler 32bit support
  | gcc11-ada                    | GNU Ada Compiler Based on GCC (GNAT)
  | gcc11-ada-32bit              | GNU Ada Compiler Based on GCC (GNAT)
  | gcc11-c++                    | The GNU C++ Compiler
  | gcc11-c++-32bit              | The GNU C++ Compiler
  | gcc11-d                      | GNU D Compiler
  | gcc11-d-32bit                | GNU D Compiler
  | gcc11-fortran                | The GNU Fortran Compiler and Support Files
  | gcc11-fortran-32bit          | The GNU Fortran Compiler and Support Files
  | gcc11-go                     | GNU Go Compiler
  | gcc11-go-32bit               | GNU Go Compiler
  | gcc11-info                   | Documentation for the GNU compiler collection
  | gcc11-locale                 | Locale Data for the GNU Compiler Collection
  | gcc11-obj-c++                | GNU Objective C++ Compiler
  | gcc11-obj-c++-32bit          | GNU Objective C++ Compiler
  | gcc11-objc                   | GNU Objective C Compiler
  | gcc11-objc-32bit             | GNU Objective C Compiler
  | gcc11-testresults            | Testsuite results
  | gcc11-testresults            | Testsuite results
  | libstdc++6-devel-gcc11       | Include Files and Libraries mandatory for Development
  | libstdc++6-devel-gcc11-32bit | Include Files and Libraries mandatory for Development
  | libstdc++6-pp-gcc11          | GDB pretty printers for the C++ standard library
  | libstdc++6-pp-gcc11-32bit    | GDB pretty printers for the C++ standard library

Now you can simply install the compilers for the programming languages you use with zypper:

sles15: # zypper install gcc11 gcc11-c++ gcc11-fortran

The compilers are installed on your system, the executables are called gcc-11, g++-11, gfortran-11 and so forth. It is also possible to install the packages in YaST. To do so, simply enter the Software Management menu in the Software section and search for gcc11. Then select the packages you want to install. Finally, click the Accept button.

Note
Note: Newer compilers on openSUSE Leap 15.3

The community distribution openSUSE Leap 15.3 shares most of the base packages with SUSE Linux Enterprise Server 15 SP3. The system compiler on systems running openSUSE Leap 15.2 is also GCC 7.5. There is no Development Tools Module for the community distribution available, but a newer compiler is provided. Simply install the packages gcc11, gcc11-c++, gcc11-fortran, and the like.

3 Optimization levels and related options

GCC has a rich optimization pipeline that is controlled by approximately a hundred of command line options. It would be impractical to force users to decide about each one of them whether they want to have it enabled when compiling their code. Like all other modern compilers, GCC therefore introduces the concept of optimization levels which allow the user to pick a configuration from a few common ones. Optionally, the user can tweak the selected level, but that does not happen frequently.

The default is to not optimize. You can specify this optimization level on the command line as -⁠O0. It is often used when developing and debugging a project. This means it is usually accompanied with the command line switch -g so that debug information is emitted. As no optimizations take place, no information is lost because of it. No variables are optimized away, the compiler only inlines functions with special attributes that require it, and so on. As a consequence, the debugger can almost always find everything it searches for in the running program and report on its state very well. On the other hand, the resulting code is big and slow. Thus this optimization level should not be used for release builds.

The most common optimization level for release builds is -⁠O2 which attempts to optimize the code aggressively but avoids large compile times and excessive code growth. Optimization level -⁠O3 instructs GCC to simply optimize as much as possible, even if the resulting code might be considerably bigger and the compilation can take longer. Note that neither -⁠O2 nor -⁠O3 imply anything about the precision and semantics of floating-point operations. Even at the optimization level -⁠O3 GCC implements math functions so that they strictly follow the respective IEEE and/or ISO rules. This often means that the compiled programs run markedly slower than necessary if such strict adherence is not required. The command line switch -⁠ffast-math is a common way to relax rules governing floating-point operations. It is out of scope of this document to provide a list of the fine-grained options it enables and their meaning. However, if your software crunches floating-point numbers and its runtime is a priority, you can look them up in the GCC manual and review what semantics of floating-point operations you need.

The most aggressive optimization level is -⁠Ofast which does imply -⁠ffast-math along with a few options that disregard strict standard compliance. In GCC 11 this level also means the optimizers may introduce data races when moving memory stores which may not be safe for multithreaded applications. Additionally, the Fortran compiler can take advantage of associativity of math operations even across parentheses and convert big memory allocations on the heap to allocations on stack. The last mentioned transformation may cause the code to violate maximum stack size allowed by ulimit which is then reported to the user as a segmentation fault. We often use level -⁠Ofast to build benchmarks. It is a shorthand for the options on top of -⁠O3 which often make them run faster and the benchmarks are usually written in a way that they still run correctly.

If you feed the compiler with huge machine-generated input, especially if individual functions happen to be extremely large, the compile time can become an issue even when using -⁠O2. In such cases, use the most lightweight optimization level -⁠O1 that avoids running almost all optimizations with quadratic complexity. Finally, the -⁠Os level directs the compiler to aggressively optimize for the size of the binary.

Note
Note: Optimization level recommendation

Usually we recommend using -⁠O2. This is the optimization level we use to build most SUSE and openSUSE packages, because at this level the compiler makes balanced size and speed trade-offs when building a general-purpose operating system. However, we suggest using -⁠O3 if you know that your project is compute-intensive and is either small or an important part of your actual workload. Moreover, if the compiled code contains performance-critical floating-point operations, we strongly advise that you investigate whether -⁠ffast-math or any of the fine-grained options it implies can be safely used.

If your project and the techniques you use to debug or instrument it do not depend on ELF symbol interposition, you may consider trying to speed it up by using -⁠fno-semantic-interposition. This allows the compiler to inline calls and propagate information even when it would be illegal if a symbol changed during dynamic linking. Using this option to signal to the compiler that interposition is not going to happen is known to significantly boost performance of some projects, most notably the Python interpreter.

Some projects use -⁠fno-strict-aliasing to work around type punning problems in the source code. This is not recommended except for very low-level hand-optimized code such as the Linux kernel. Type-based alias analysis is a very powerful tool. It is used to enable other transformations, such as store-to-load propagation that in turn enables other high level optimizations, such as aggressive inlining, vectorization and others.

With the -g switch GCC tries hard to generate useful debug information even when optimizing. However, a lot of information is irrecoverably lost in the process. Debuggers also often struggle to present the user with a view of the state of a program in which statements are not necessarily executed in the original order. Debugging optimized code can therefore be a challenging task but usually is still somewhat possible.

The complete list of optimization and other command line switches is available in the compiler manual, provided in the info format in the package gcc11-info or online at the GCC project Web site.

Bear in mind that although almost all optimizing compilers have the concept of optimization levels and their optimization levels often have the same names as those in GCC, they do not necessarily mean to make the same trade-offs. Famously, GCC's -⁠Os optimizes for size much more aggressively than LLVM/Clang's level with the same name. Therefore, it often produces slower code; the more equivalent option in Clang is -⁠Oz which GCC does not have. Similarly, -⁠O2 can have different meanings for different compilers. For example, the difference between -⁠O2 and -⁠O3 is much bigger in GCC than in LLVM/Clang.

Note
Note: Changing the optimization level with cmake

If you use cmake to configure and set up builds of your application, be aware that its release optimization level defaults to -⁠O3 which might not be what you want. To change it, you must modify the CMAKE_C_FLAGS_RELEASE, CMAKE_CXX_FLAGS_RELEASE and/or CMAKE_Fortran_FLAGS_RELEASE variables, since they are appended at the end of the compilation command lines, thus overwriting any level set in the variables CMAKE_C_FLAGS, CMAKE_CXX_FLAGS, and the like.

4 Taking advantage of newer processors

By default GCC assumes that you want to run the compiled program on a wide variety of CPUs, including fairly old ones, regardless of the selected optimization level. On architectures like x86_64 and aarch64 the generated code will only contain instructions available on every CPU model of the architecture, including the earliest ones. On x86_64 in particular this means that the programs will use the SSE and SSE2 instruction sets for floating-point and vector operations but not any more recent ones.

If you know that the generated binary will run only on machines supporting newer instruction set extensions, you can specify it on the command line. Their complete list is available in the manual, but the most prominent one is -⁠march which lets you select a CPU model to generate code for. For example, if you know that your program will only be executed on AMD EPYC 7003 Series Processors which is based on AMD Zen 3 cores or processors that are compatible with it, you can instruct GCC to take advantage of all the instructions the CPU supports with option -⁠march=znver3. Note that on SUSE Linux Enterprise Server 15, the system compiler does not know this particular value of the switch; you need to use GCC 11 from the Development Tools Module to optimize code for these processors.

To run the program on the machine on which you are compiling it, you can have the compiler auto-detect the target CPU model for you with the option -⁠march=native. This only works if the compiler is new enough. The system compiler of SUSE Linux Enterprise Server, for example, misidentifies AMD EPYC 7003 Series Processors as being based on the AMD Zen 1 core. Among other things, this means that it only emits 128bit vector instructions, even though the CPU has data-paths wide enough to efficiently process 256bit ones. Again, the easy solution is to use the compiler from the Development Tools Module when targeting recent processors.

5 Link Time Optimization (LTO)

Figure 1 outlines the classic mode of operation of a compiler and a linker. Pieces of a program are compiled and optimized in chunks defined by the user called compilation units to produce so-called object files which already contain binary machine instructions and which are combined together by a linker. Because the linker works at such low level, it cannot perform much optimization and the division of the program into compilation units thus presents a profound barrier to optimization.

Traditional program build
Figure 1: Traditional program build

This limitation can be overcome by rearranging the process so that the linker does not receive as its input the almost finished object files containing machine instructions, but is invoked on files containing so called intermediate language (IL) which is a much richer representation of each original compilation unit (see figure figure 2). The linker identifies the input as not yet entirely compiled and invokes a linker plugin which in turn runs the compiler again. But this time it has at its disposal the representation of the entire program or library that is being built. The compiler makes decisions about what optimizations across function and compilation unit boundaries will be carried out and then divides the program into a set of partitions. Each of the partitions is further optimized independently, and machine code is emitted for it, which is finally linked the traditional way. Processing of the partitions is performed in parallel.

Building a program with GCC using Link Time Optimization (LTO)
Figure 2: Building a program with GCC using Link Time Optimization (LTO)

To use Link Time Optimization, all you need do is to add the -⁠flto switch to the compilation command line. The vast majority of packages in the Linux distribution openSUSE Tumbleweed has been built with LTO for over two years without any major problems. A lot of work has been put into emitting good debug information when building with LTO too. Thus the debugging experience is not severely limited anymore as it was a couple of years ago.

LTO in GCC always consists of a whole program analysis (WPA) stage followed by the majority of the compilation process performed in parallel, which greatly reduces the build times of most projects. To control the parallelism, you can explicitly cap the number of parallel compilation processes by n if you specify -⁠flto=n at linker command line. Alternatively, it is possible to use the GNU make jobserver with -⁠flto=jobserv while also prepending the makefile rule invoking link step with character + to instruct GNU make to keep the jobserver available to the linker process. You can also use -⁠flto=auto which instructs GCC to search for the jobserver and if it is not found, use all available CPU threads.

Note that there is a principal architectural difference in how GCC and LLVM/Clang approach LTO. Clang provides two LTO mechanisms, so-called thin LTO and full LTO. In full LTO, LLVM processes the whole program as if it was a single translation unit which does not allow for any parallelism. GCC can be configured to operate in this way with the option -⁠flto-partition=one. LLVM in thin LTO mode can compile different compilation units in parallel and makes possible inlining across compilation unit boundaries, but not most other types of cross-module optimizations. This mechanism therefore has inherently higher code quality penalty than full LTO or the approach of GCC.

5.1 Most notable benefits of LTO

Applications built with LTO are often faster, mainly because the compiler can inline calls to functions in another compilation unit. This possibility also allows programmers to structure their code according to its logical division because they are not forced to put function definitions into header files to enable their inlining. Because the compiler cannot inline all calls conveying information known at compilation time, GCC tracks and propagates constants, value ranges and devirtualization contexts from the call sites to the callees, often even when passed in an aggregate or by reference. These can then subsequently save unnecessary computations. LTO allows such propagation across compilation unit boundaries, too.

Link Time Optimization with whole program analysis also offers many opportunities to shrink the code size of the built project. Thanks to symbol promotion and inter-procedural unreachable code elimination, functions and their parts which are not necessary in any particular project can be removed even when they are not declared static and are not defined in an anonymous namespace. Automatic attribute discovery can identify C++ functions that do not throw exceptions which allows the compiler to avoid generating a lot of code in exception cleanup regions. Identical code folding can find functions with the same semantics and remove all but one of them. The code size savings are often very significant and a compelling reason to use LTO even for applications which are not CPU-bound.

Note
Note: Building libraries with LTO

The symbol promotion is controlled by resolution information given to the linker and depends on type of the DSO build. When producing a dynamically loaded shared library, all symbols with default visibility can be overwritten by the dynamic linker. This blocks the promotion of all functions not declared inline, thus it is necessary to use the hidden visibility wherever possible to achieve best results. Similar problems happen even when building static libraries with -rdynamic.

5.2 Potential issues with LTO

As noted earlier, the vast majority of packages in the openSUSE Tumbleweed distribution are built with LTO without any need to tweak them, and they work fine. Nevertheless, some low-level constructs pose a problem for LTO. One typical issue are symbols defined in inline assembly which can happen to be placed in a different partition from their uses and subsequently fail the final linking step. To build such projects with LTO, the assembler snippets defining symbols must be placed into a separate assembler source file so that they only participate in the final linking step. Global register variables are not supported by LTO, and programs either must not use this feature or be built the traditional way.

Another notable limitation of LTO is that it does not support symbol versioning implemented with special inline assembly snippets (as opposed to a linker map file). To define symbol versions in the source files, you can do so with the new symver function attribute. As an example, the following snippet will make the function foo_v1 implement foo in node VERS_1 (which must be specified in the version script supplied to the linker). Consult the manual for more details.

__attribute__ ((__symver__ ("foo@VERS_1")))
int foo_v1 (void)
{
}

Sometimes the extra power of LTO reveals pre-existing problems which do not manifest themselves otherwise. Violations of (strict) aliasing rules and C++ one definition rule tend to cause misbehavior significantly more often; the latter is fortunately reported by the -Wodr warning which is on by default and should not be ignored. We have also seen cases where the use of the flatten function attribute led to unsustainable amount of inlining with LTO. Furthermore, LTO is not a good fit for code snippets compiled by configure scripts (generated by autoconf) to discover the availability of various features, especially when the script then searches for a string in the generated assembly.

Finally, we needed to configure the virtual machines building the biggest openSUSE packages to have more memory than when not using LTO. Whereas in the traditional mode of compilation 1 GB of RAM per core was enough to build Mozilla Firefox, the serial step of LTO means the build-bots need 16 GB even when they have fewer than 16 cores.

6 Profile-Guided Optimization (PGO)

Optimizing compilers frequently make decisions that depend on which path through the code they consider most likely to be executed, how many times a loop is expected to iterate, and similar estimates. They also often face trade-offs between potential runtime benefits and code size growth. Ideally, they would optimize only frequently executed (also called hot) bits of a program for speed and everything else for size to reduce strain on caches and make the distribution of the built software cheaper. Unfortunately, guessing which parts of a program are the hot ones is difficult, and even sophisticated estimation algorithms implemented in GCC are no match for a measurement.

If you do not mind adding an extra level of complexity to the build system of your project, you can make such measurement part of the process. The makefile (or any other) build script needs to compile the project twice. The first time it needs to compile with the -⁠fprofile-generate option and then execute the resulting binary in one or multiple train runs during which it will save information about the behavior of the program to special files. Afterward, the project needs to be rebuilt again, this time with the -⁠fprofile-use option which instructs the compiler to look for the files with the measurements and use them when making optimization decisions, a process called Profile-Guided Optimization (PGO).

It is important that the train run exhibits the same characteristics as the real workload. Unless you use the option -⁠fprofile-partial-training in the second build, it needs to exercise the code that is also the most frequently executed in real use, otherwise it will be optimized for size and PGO would make more harm than good. With the option, GCC reverts to guessing properties of portions of the projects not exercised in the train run, as if they were compiled without profile feedback. This however also means that the code size will not perform better or shrink as much as one would expect from a PGO build.

On the other hand, train runs do not need to be a perfect simulation of the real workload. For example, even though a test suite should not be a very good train run in theory because it disproportionally often tests various corner cases, in practice many projects use it as a train run and achieve significant runtime improvements with real workloads, too.

Profiles collected using an instrumented binary for multithreaded programs may be inconsistent because of missed counter updates. You can use -⁠fprofile-correction in addition to -⁠fprofile-use so that GCC uses heuristics to correct or smooth out such inconsistencies instead of emitting an error.

Profile-Guided Optimization can be combined and is complimentary to Link Time Optimization. While LTO expands what the compiler can do, PGO informs it about which parts of the program are the important ones and should be focused on. The following sections detail this by means of two rather different case studies.

7 Performance evaluation: SPEC CPU 2017

Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation that publishes a variety of industry standard benchmarks to evaluate performance and other characteristics of computer systems. Its latest suite of CPU intensive workloads, SPEC CPU 2017, is often used to compare compilers and how well they optimize code with different settings because the included benchmarks are well known and represent a wide variety of computation-heavy programs. This section highlights selected results of a GCC 11 evaluation using the suite.

Note that when we use SPEC to perform compiler comparisons, we are lenient toward some official SPEC rules which system manufacturers need to observe to claim an official score for their system. We disregard the concepts of base and peak metrics and simply focus on results of compilations using a particular set of options. We even patched several benchmarks:

  • Benchmarks 502.gcc_r, 505.mcf_r, 511.povray_r, and 527.cam4_r contain an implementation of quicksort which violates (strict) C/C++ aliasing rules which can lead to erroneous behavior when optimizing at link time. SPEC decided not to change the released benchmarks and simply suggests that these benchmarks are built with the -⁠fno-strict-aliasing option when they are built with GCC. That makes evaluation of compilers using SPEC problematic, gauging their ability to use aliasing rules to facilitate optimizations is important. We have therefore disabled it only for the problematic qsort functions with the following function attribute:

    __attribute__((optimize("-fno-strict-aliasing")))

    As a result, the only benchmark which we compile with -⁠fno-strict-aliasing is 500.perlbench_r.

  • We have increased the tolerance of 549.fotonik3d_r to rounding errors after it became clear the intention was that the compiler can use relaxed semantics of floating-point operations in the benchmark (see GCC bug 84201).

For these and other reasons, all the results in this document are non-reportable. Finally, SPEC 2017 CPU offers so-called speed and rate metrics. For our purposes, we mostly ignore the differences and simply run the benchmarks configured for rate metrics (mainly because the runtimes are smaller) but we always run all benchmarks single-threaded.

SPEC specifies a base runtime for each benchmark and defines a rate as the ratio of the base runtime and the median measured runtime (this rate is a separate concept from the rate metrics). The overall suite score is then calculated as geometric mean of these ratios. The bigger the rate or score, the better it is. In the remainder of this section, we report runtimes using relative rates and their geometric means as they were measured on an AMD EPYC 7543P Processor running SUSE Linux Enterprise Server 15 SP3.

7.1 Benefits of LTO and PGO

In Section 3, “Optimization levels and related options” we recommend that HPC workloads are compiled with -⁠O3 and benchmarks with -⁠Ofast. But it is still interesting to look at integer crunching benchmarks built with only -⁠O2 because that is how Linux distributions often build the programs from which they were extracted. We have already mentioned that almost the whole openSUSE Tumbleweed distribution is now built with LTO, and selected packages with PGO, and the following paragraphs demonstrate why.

Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 11.2 and -⁠O2
Figure 3: Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 11.2 and -⁠O2

Figure 3 shows the overall performance effect on the whole integer benchmark suite as captured by the geometric mean of all individual benchmark rates. The remarkable uplift of performance when using PGO is mostly down to much quicker 525.x264_r (see figure 4). The reason is that with profile feedback GCC performs vectorization also at -⁠O2 and this benchmark benefits a great deal from vectorization. In practice it really should be compiled with at least -⁠O3. Nevertheless, several benchmarks which are derived from programs that are typically compiled with -⁠O2 also benefit from these advanced modes of compilation, as can be seen in figure 5.

Runtime performance (bigger is better) of 525.x264_r built with GCC 11.2 and -⁠O2
Figure 4: Runtime performance (bigger is better) of 525.x264_r built with GCC 11.2 and -⁠O2
Runtime performance (bigger is better) of the rest of the integer benchmarks built with GCC 11.2 and -⁠O2
Figure 5: Runtime performance (bigger is better) of the rest of the integer benchmarks built with GCC 11.2 and -⁠O2

Figure 6 shows their other important advantage which is the reduction of the size of the binaries (measured without debug info), which can be significant with LTO or a combination of LTO and PGO. Note that it does not depict that the size of benchmark 548.exchange2_r grew by 250% and 50% when built with PGO or both PGO and LTO respectively, which looks huge but the growth is from a particularly small base. Keep in mind that it is the only Fortran benchmark in the integer suite and that the size penalty is offset by significant speed-up, making the trade-off reasonable, especially in the LTO+PGO case. For completeness, we show this result in figure 7

Binary size (smaller is better) of individual integer benchmarks built with GCC 11.2 and -⁠O2
Figure 6: Binary size (smaller is better) of individual integer benchmarks built with GCC 11.2 and -⁠O2
Binary size (smaller is better) of 548.exchange2_r built with GCC 11.2 and -⁠O2
Figure 7: Binary size (smaller is better) of 548.exchange2_r built with GCC 11.2 and -⁠O2

The runtime benefits and binary size savings are also substantial when using the optimization level -⁠Ofast and option -⁠march=native to allow the compiler to take full advantage of all instructions that the AMD EPYC 7543P Processor supports. Figure 8 shows the respective geometric means, the seemingly underwhelming overall performance of the compilation methods using PGO are caused by the fact that GCC 11 can optimize 548.exchange2_r much better than previous versions, but only without profile feedback. This happens because of shortcomings in how the compiler updates internal profile information during the new important transformations which have been addressed for GCC 12. When evaluating these compilation methods on integer SPEC benchmarks, it is perhaps better to look at individual scores as they are shown in figure 9. If we excluded 548.exchange2_r, the geometric mean of LTO+PGO method would exceed 110% of the standard one. Even though optimization levels -⁠O3 and -⁠Ofast are permitted to be relaxed about the final binary size, PGO and especially LTO can bring it nicely down at these levels, too. Figure 10 depicts the relative binary sizes of all integer benchmarks.

Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native
Figure 8: Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native
Runtime performance (bigger is better) of individual integer benchmarks built with GCC 11.2 using -⁠Ofast and -⁠march=native
Figure 9: Runtime performance (bigger is better) of individual integer benchmarks built with GCC 11.2 using -⁠Ofast and -⁠march=native
Binary size (smaller is better) of SPEC INTrate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native
Figure 10: Binary size (smaller is better) of SPEC INTrate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native

Many of the SPEC 2017 floating-point benchmarks measure how well a given system can optimize and execute a handful of number crunching loops and they often come from performance sensitive programs written with traditional compilation method in mind. Consequently there are fewer cross-module dependencies, identifying hot paths is less crucial and the effect of LTO and PGO on most of the floating-point benchmarks is often limited. Nevertheless, there are important cases when these modes of compilation also bring about significant performance increases. Figure 11 shows the effect of these methods on individual benchmarks when compiled at -⁠Ofast and targeting the full ISA of the AMD EPYC 7543P Processor.

Runtime performance (bigger is better) of individual floating-point benchmarks built with GCC 11.2 using -⁠Ofast and -⁠march=native
Figure 11: Runtime performance (bigger is better) of individual floating-point benchmarks built with GCC 11.2 using -⁠Ofast and -⁠march=native

The problematic case of 521.wrf_r is an unfortunate result of SPEC scripts re-compiling parts of the benchmark when building the verifying utility and GCC 11 not being able to merge the measured profile of the benchmark train run and verification[4]. In practice this issue should be easy to avoid and upcoming GCC 12 has been enhanced to merge measured profiles even in this case. Excluding this problematic benchmark, the geometric mean of rates using PGO+LTO is 5% higher. Binary size savings of PGO and LTO are sometimes even bigger than those achieved on integer benchmarks, as can be seen on figure 12

Binary size (smaller is better) of SPEC fprate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native
Figure 12: Binary size (smaller is better) of SPEC fprate 2017 built with GCC 11.2 using -⁠Ofast and -⁠march=native

7.2 GCC 11.2 compared to GCC 7.5

In previous sections we have recommended the use of GCC 11.2 from the Development Tools Module over the system compiler. Among other reasons, we did so because of its more powerful optimization pipeline and its support for newer CPUs. This section compares SPEC CPU 2017 obtained with GCC 7.5, which corresponds to the system compiler in SUSE Linux Enterprise Server 15, and GCC 11.2 on an AMD EPYC 7543P Processor, when all benchmarks are compiled with -⁠Ofast and -⁠march=native. Note that the latter option means that both compilers differ in their CPU targets because GCC 7.5 does not know the Zen 3 core. This in turn means that in large part the optimization benefits presented here present themselves because the newer compiler can take advantage of the full width of vector paths in the processor whereas the old one uses only a half. Nevertheless, be aware that simply using wider vectors everywhere often backfires. GCC has made substantial advancements over the recent years to avoid such issues, both in its vectorizer and other optimizers. It is therefore much better placed to use the extra vector width appropriately and produce code which utilizes the processor better in general.

Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)
Figure 13: Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)

Figure 13 captures the benefits of using the modern compiler with integer workloads in the form of relative improvements of the geometric mean of the whole SPEC INTrate 2017 suite. Figure 14 dives deeper and shows which particular benchmarks gained most in terms of performance. It was already mentioned that 525.x264_r especially benefits from vectorization and therefore it is not surprising it has improved a lot. 531.deepsjeng_r is faster chiefly because it can emit better code for count trailing zeros (CTZ) operation which it performs frequently. In the previous chapter we also claimed that GCC 11 can optimize 548.exchange2_r particularly well, it does so by specializing different invocations of the hottest recursive function, and it also clearly shows in the picture.

Runtime performance (bigger is better) of selected integer benchmarks built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)
Figure 14: Runtime performance (bigger is better) of selected integer benchmarks built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)

Floating-point computations tend to particularly benefit from vectorization advancements and so it should be no surprise that the fprate benchmarks also improve substantially when compiled with GCC 11, as shown in figure 15. Figure 16 provides a detailed look at which benchmarks contributed most to the overall score difference. The 7% regression of 519.lbm_r is an issue we did not previously see on Zen 2 based processors. We do not yet know why it takes place but it only happens with LTO and not when using both PGO and LTO.

Overall performance (bigger is better) of SPEC fprate 2017 built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)
Figure 15: Overall performance (bigger is better) of SPEC fprate 2017 built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)
Runtime performance (bigger is better) of selected floating-point benchmarks built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)
Figure 16: Runtime performance (bigger is better) of selected floating-point benchmarks built with GCC 7.5 and 11.2 (-⁠Ofast -⁠march=native)

7.3 Effects of -⁠ffast-math on floating-point performance

In Section 3, “Optimization levels and related options” we pointed out that, if you do not relax the semantics of floating-point math functions even though you do not need strict adherence to all respective IEEE and/or ISO rules, you are likely to be leaving some performance on the table. This section uses the SPEC fprate 2017 test suite to illustrate how much performance that might be.

We have built the benchmarking suite (without LTO or PGO) using optimization level -⁠O3 and -⁠march=native to target the native ISA of our AMD EPYC 7543P Processor and we compared its runtime score against the suite built with these options and -⁠ffast-math. As you can see in figure 17, the geometric means grew by over 8%, but a quick look at figure 18 will tell you that there are benchmarks with scores which improved twice as much and that of 503.bwaves_r grew by over 50%.

Overall performance (bigger is better) of SPEC fprate 2017 built with GCC 11.2 and -⁠O3 -⁠march=native, without and with -⁠ffast-math
Figure 17: Overall performance (bigger is better) of SPEC fprate 2017 built with GCC 11.2 and -⁠O3 -⁠march=native, without and with -⁠ffast-math
Runtime performance (bigger is better) of selected floating-point benchmarks built with GCC 11.2 and -⁠O3 -⁠march=native, without and with -⁠ffast-math
Figure 18: Runtime performance (bigger is better) of selected floating-point benchmarks built with GCC 11.2 and -⁠O3 -⁠march=native, without and with -⁠ffast-math

7.4 Comparison with other compilers

The toolchain team at SUSE regularly uses the SPEC CPU 2017 suite to compare the optimization capabilities of GCC with other compilers, mainly Intel ICC and LLVM/Clang. In the final section of this case study we will share how the Development Module compiler stands compared to these competitors on SUSE Linux Enterprise Server 15 SP3. Before we start, we should emphasize that the comparison has been carried out by people who have much better knowledge of GCC than of the other compilers and are not unbiased. Also, keep in mind that everything we explained previously about how we carry out the measurements and patch the benchmarks also applies to this section. On the other hand, the results often guide our work and therefore we strive to be accurate.

Even though ICC is not intended as a compiler for AMD processors, it is known for its high-level optimization capabilities, especially when it comes to vectorization. Thus we believe comparisons with it are useful even on AMD CPUs. We have therefore compiled SPEC suite with ICC 2021.3.0 with options -⁠Ofast and -⁠march=core-avx2 option and compared the runtimes on the AMD EPYC 7543P Processor with binaries produced with GCC 11.2 with -⁠Ofast and -⁠march=native.

Overall performance (bigger is better) of SPEC INTrate 2017 built with ICC 2021.3.0 and GCC 11.2
Figure 19: Overall performance (bigger is better) of SPEC INTrate 2017 built with ICC 2021.3.0 and GCC 11.2

Figure 19 depicts the respective geometric means of rates of all the integer benchmarks when compiled with the two compilers, without and with LTO, relative to performance of ICC without LTO. Unlike the previous version, GCC manages to achieve better score in both the traditional compilation mode and with LTO. Figure 20 shows that while ICC achieves better scores on benchmark 505.mcf_r and with LTO also on 531.deepsjeng_r and 548.exchange2_r, in all other cases GCC takes the lead.

Runtime performance (bigger is better) of individual integer benchmarks built with ICC 2021.3.0 and GCC 11.2
Figure 20: Runtime performance (bigger is better) of individual integer benchmarks built with ICC 2021.3.0 and GCC 11.2

In the previous version of this document from the last year we needed to conclude that ICC was achieving substantially better scores on the floating-point suite because it used its own math library which had much faster implementation of some key functions such as complex exponential than the glibc one in SUSE Linux Enterprise Server 15 SP2. Fortunately, SUSE Linux Enterprise Server 15 SP3 ships with a newer glibc 2.31 which comes with faster implementations of these functions and other features that allow us to report better scores relative to ICC this year. So much so that GCC wins by a small margin when it comes to geometric mean of rates, as you can see on Figure 21.

Overall performance (bigger is better) of SPEC fprate 2017 built with ICC 2021.3.0 and GCC 11.2
Figure 21: Overall performance (bigger is better) of SPEC fprate 2017 built with ICC 2021.3.0 and GCC 11.2

It is important to look at individual results too because most of this difference is because of GCC optimizes 549.fotonik3d_r far better than ICC (see figure 22). We have not examined the difference but have observed big wins for GCC on this benchmark even on an Intel Cascade Lake machine. Looking at other benchmarks the results are more even and ICC achieves better scores on a number of them (see figure 23). Nevertheless, GCC is generally competitive against this high-performance compiler.

Runtime performance (bigger is better) of 549.fotonik3d_r built with ICC 2021.3.0 and GCC 11.2
Figure 22: Runtime performance (bigger is better) of 549.fotonik3d_r built with ICC 2021.3.0 and GCC 11.2
Runtime performance (bigger is better) of the rest of the floating point benchmarks built with ICC 2021.3.0 and GCC 11.2
Figure 23: Runtime performance (bigger is better) of the rest of the floating point benchmarks built with ICC 2021.3.0 and GCC 11.2

Comparisons with LLVM/Clang 13.0 are incomplete because although the newest release of this compiler suite already contains a Fortran compiler called flang, we could not compile all SPEC Fortran benchmarks with it. Moreover, the status of flang is not clear to us, and strace indicates that it internally invokes the system gfortran compiler rather than using the LLVM infrastructure.

Therefore we have excluded benchmarks using Fortran from our comparison with LLVM/Clang 13.0. We have built the clang and clang++ compilers from sources obtained from the official git repository (tag llvmorg-13.0.0), used it to compile the SPEC CPU 2017 suite with -⁠Ofast and -⁠march=native and compared the performance against the suites built with GCC 11.2 with the same options. When using Clang's LTO to compile SPEC, we selected the full variant.

Runtime performance (bigger is better) of C/C++ integer benchmarks built with Clang 13 and GCC 11.2
Figure 24: Runtime performance (bigger is better) of C/C++ integer benchmarks built with Clang 13 and GCC 11.2

Figure 24 captures the integer benchmarks with significant performance differences. Clang scores one notable win with 531.deepsjeng_r, which we can mostly pin down to an if-conversion transformation which we are not certain is generally profitable, and is slightly faster with 557.xz_r. On the other hand, GCC wins on the benchmark derived from itself and is remarkably faster on 500.perlbench_r, 505.mcf_r and 525.x264_r. If we omitted the one Fortran benchmark in calculating the overall geometric mean, GCC would win by 5% both with and without LTO against the equivalent result of LLVM/Clang.

The floating point benchmark suite contains many more Fortran benchmarks, which makes the comparison more difficult. For more information, refer to figure 25 capturing the significant differences in scores of benchmarks written entirely in C/C++. Overall, LLVM/Clang has managed a slight lead, geometric mean of the GCC rates of the seven benchmarks is 2% lower than that of LLVM/Clang. In the case of 544.nab_r, we have identified and addressed one of the causes and the next release of GCC will manage to narrow the gap by roughly a half.

Runtime performance (bigger is better) of C/C++ floating point benchmarks built with Clang 13 and GCC 11.2
Figure 25: Runtime performance (bigger is better) of C/C++ floating point benchmarks built with Clang 13 and GCC 11.2

8 Performance evaluation: Mozilla Firefox

Benchmarks such as SPEC CPU 2017 are very useful to gauge how compilers optimize a particular type of computation when it is embedded in a small- to medium-size project. Nevertheless, many real-world applications are much bigger. This fact alone presents a significant challenge to optimizing compilers. When there are too many opportunities for a transformation which has a potential to increase performance but comes with a substantial code size growth, such as inlining, the compiler simply cannot go ahead and proceed with all of them because the final size of the binary would be unacceptably big. To monitor how well GCC manages to optimize big real applications, we regularly build and benchmark the popular Mozilla Firefox browser [5]. This section summarizes our latest findings.

We use Mozilla's own Treeherder project to build Firefox with different compilers and options, and their Talos and Perfherder infrastructure to evaluate their performance. The evaluation framework compares different binaries using several benchmarks which it can run sufficiently many times to eliminate their noise. The CPUs used in Mozilla Talos to benchmark x86_64 Linux builds are Intel E3-1585L v5. But as shown further on, we measured similar results on an AMD Ryzen 7 5800X 8-Core Processor, although on simpler benchmarks and with much fewer data points.

Another important difference to traditional benchmarks is that most of Mozilla Firefox is implemented in a shared object file compiled as position independent code. This limits code generation in some situations, such as data segment relocations or global variable addressing [6]. Note that all Firefox binaries, regardless of the compiler used, were built with the following options in addition to any that are explicitly called out:

-fno-sized-deallocation -fno-aligned-new -fno-strict-aliasing -fPIC -fno-exceptions
-fno-rtti -fno-math-errno -fno-exceptions -fno-fomit-frame-pointer

Unfortunately, Mozilla Firefox is one of the projects which has elected to use -⁠fno-strict-aliasing rather than fix aliasing violations in their code, despite the performance implications it has.

To compare the two compilers, we needed to deal with cases when the skia and gl libraries compile different code with GCC and different with Clang because in the latter it uses Clang-specific vector extensions. We have simply used the following pre-compiled assembly files in place of the original sources:

./gfx/skia/skia/src/opts/SkOpts_ssse3.s
./gfx/skia/skia/src/opts/SkOpts.s
./gfx/skia/skia/src/opts/SkOpts_sse41.s
./gfx/skia/skia/src/opts/SkOpts_avx.s
./gfx/skia/skia/src/opts/SkOpts_hsw.s
./gfx/skia/skia/src/opts/SkOpts_sse42.s
./gfx/wr/swgl/src/gl.s

When we compare any two Firefox binaries, we start by looking at their size, excluding debug information sections. The size is often important by itself, such as when applications are updated over slower networks, but it is also a sign of how well the compiler can distinguish performance sensitive and rarely executed pieces of a project. To evaluate runtime performance, in this document we selected four benchmarks. The chief benchmark among them is tp5o [7] which measures the time it takes Firefox to load the tp5 Web page test set which contains a collection of 151 pages originally picked from a list of 500 most popular ones in 2011. We also look at tp5o responsiveness test [8] measuring how responsive Firefox is while carrying out a non-trivial workload. The last Talos test we have chosen to focus on is called perf reftest singletons [9]. It is a micro-benchmark that loads simple HTML pages and then measures basic manipulation with their elements, such as adding a row to a table. This benchmark itself is part of the train run in PGO builds, and thus the PGO binaries should be well trained for it. Finally, we have used Speedometer 2.0 [10] to cross-check selected results from Talos on an AMD Ryzen 7 5800X 8-Core Processor. Speedometer is a rather simple benchmark which simulates user actions for adding, completing, and removing to-do items using DOM APIs in different ways. Speedometer is also part of the profile train run.

8.1 Effects of -⁠O3 compared to -⁠O2 and of LTO and PGO

Mozilla Firefox is a large application. The code size should therefore definitely play a role when deciding how to compile it. On the other hand, a Web browser is also likely to be a substantial part of a typical desktop workload. Thus gains in performance can easily justify binary size increases. As a consequence, Firefox is typically built with -⁠O3. Figure 26 depicts the sizes of the Firefox libxul library, which contains the bulk of the browser, when built with GCC 11 using the Mozilla Treeherder infrastructure with the optimization levels and modes most discussed in this document. Again, you can see that LTO can reduce the code size to an extent that more than offsets the difference between -⁠O3 and -⁠O2. Note that, since a big portion of Firefox is written in Rust and the whole program analysis is limited to the parts written in C++, the LTO benefits are smaller than the typical case, in terms of both size and performance. Work on the Rust GCC front-end has started only recently but we hope that we will overcome this limitation. Nevertheless, as demonstrated throughout this case study, LTO combined with PGO is by far the best option, not only in code size comparison but also in any other measurement.

Code size (smaller is better) of Firefox binaries built with GCC 11.2 with different options
Figure 26: Code size (smaller is better) of Firefox binaries built with GCC 11.2 with different options

When not employing neither LTO nor PGO, the performance difference between -⁠O2 and -⁠O3 is visible but modest in the tp5o benchmark results, somewhat more pronounced looking at singletons benchmark, but barely visible in the measure of responsiveness. Our data indicate that the use of LTO regresses the responsiveness benchmark by 2-3% at both -⁠O2 and -⁠O3. The size of the performance drop is close to the noise level and so difficult to investigate but we believe it takes place because the code growth limits, when applied on the entire binary, prevent useful inlining which is allowed when growth limits are applied to individual compilation units.

LTO improves performance slightly in all the other benchmarks. The gain is small but it should be assessed together with the code size LTO brings about. The real speed-up comes only when PGO is added into the formula, leading to performance gain of 9% in the responsiveness test and over 17% in all other benchmarks. This observation holds for both the data measured using the Talos and Perfherder systems (figure 27) and speedometer results we obtained manually on an AMD Ryzen 7 5800X 8-Core Processor (figure 28). This is especially remarkable when you consider that the binary is more than 20% smaller than a simple -⁠O2 build.

Runtime performance (bigger is better) of Firefox built with GCC 11.2 with different options, running on Mozilla Talos infrastructure
Figure 27: Runtime performance (bigger is better) of Firefox built with GCC 11.2 with different options, running on Mozilla Talos infrastructure
Runtime performance (bigger is better) of Firefox built with GCC 11.2 with different options, running Speedometer 2.0 on an AMD Ryzen 7 5800X 8-Core Processor
Figure 28: Runtime performance (bigger is better) of Firefox built with GCC 11.2 with different options, running Speedometer 2.0 on an AMD Ryzen 7 5800X 8-Core Processor

8.2 GCC 11.2 compared to GCC 7.5

Section 7.2 demonstrates that GCC 11 produces much faster code when targeting modern CPUs such as the AMD EPYC 7003 Series Processor, often because it can take advantage of vector instructions of the new hardware. This section aims to show that GCC 11 produces faster code also when emitting instructions for any x86_64 system and running on processors that are not as new. We have compared Firefox binaries built with GCC 7.5 and 11.2 using -⁠O2 optimization level and classic compilation method, that is not with LTO nor PGO. Unfortunately, our attempts to build a modern Firefox with them and the old compiler have failed. The sizes of the binary produced by both compilers are very similar but the one created with GCC 10 has always performed noticeably better. In the tp5o responsiveness benchmark, the simple -⁠O2 build was 10% faster (see figure 29).

Runtime performance (bigger is better) of Firefox built with GCC 7.5 and 11.2, running on Mozilla Talos infrastructure
Figure 29: Runtime performance (bigger is better) of Firefox built with GCC 7.5 and 11.2, running on Mozilla Talos infrastructure

8.3 Comparison with Clang 13

To evaluate how GCC is doing compared to other compilers, we regularly compare the Firefox binaries produced by GCC to those emitted by LLVM/Clang, which is currently the preferred compiler by the team at Mozilla. Before we proceed, we should emphasize that the authors of this document are not nearly as familiar with LLVM/Clang as they are with GCC, and that they are not unbiased. On the other hand, our findings in such comparisons guide our own future work and therefore we strive to be accurate.

Because the notion of compilation levels is somewhat different in both of these compilers, we have focused on evaluating how they build Firefox using -⁠O3 in the traditional way, when using LTO (in Clang's case its thin variant), and when using both PGO and LTO. Possibly the most striking differences are between code sizes of the results (see figure 30). When using plain -⁠O3, GCC produces a 9% larger binary than Clang. With LTO, GCC manages to shrink the code size to more than undo this difference, whereas Clang uses the extra cross-module inlining opportunities to grow the code by 12%. The difference is even more pronounced with PGO in addition to LTO, which enables Clang to produce a binary that is 5% smaller than using neither of them, while GCC creates the smallest binary of all, 22% smaller than Clang using the same options and 24% than itself when using plain -⁠O3.

Code size (smaller is better) of Firefox binaries built with GCC 11.2 and Clang 11
Figure 30: Code size (smaller is better) of Firefox binaries built with GCC 11.2 and Clang 11

Runtime comparisons measured on Talos can be found in figure 31. In the tp5o benchmark, the GCC with the help of both PGO and LTO manages to produce code that is 10% quicker, the performance using other compilation methods was comparable. In the responsiveness measurement, it was Clang that was 6% faster when using both PGO and LTO. Like in the previous case, other respective compilation methods of the two compilers performed similarly, except for the LTO regression discussed in section 8.1. In the singletons benchmark, GCC was always distinctly faster.

Runtime performance (bigger is better) of Firefox with GCC 11.2 and Clang 11, running on Mozilla Talos infrastructure
Figure 31: Runtime performance (bigger is better) of Firefox with GCC 11.2 and Clang 11, running on Mozilla Talos infrastructure

Running speedometer on an AMD Ryzen 7 5800X 8-Core Processor (see figure 32) did not show any meaningful difference in performance of the code produced by the two compilers. The worst runtime measured for a given compilation method of one compiler was always worse than the best one of the other. As we have emphasized earlier though, in the case of the most powerful method, GCC can achieve this performance while producing a binary that is 22% smaller than Clang.

Runtime performance (bigger is better) of Firefox with GCC 11.2 and Clang 11, running Speedometer 2.0 on an AMD Ryzen 7 5800X 8-Core Processor
Figure 32: Runtime performance (bigger is better) of Firefox with GCC 11.2 and Clang 11, running Speedometer 2.0 on an AMD Ryzen 7 5800X 8-Core Processor

9 Legal notice

Copyright ©2006-2025 SUSE LLC and contributors. All rights reserved.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled GNU Free Documentation License.

SUSE, the SUSE logo and YaST are registered trademarks of SUSE LLC in the United States and other countries. For SUSE trademarks, see http://www.suse.com/company/legal/. Linux is a registered trademark of Linus Torvalds. All other names or trademarks mentioned in this document may be trademarks or registered trademarks of their respective owners.

Documents published as part of the SUSE Best Practices series have been contributed voluntarily by SUSE employees and third parties. They are meant to serve as examples of how particular actions can be performed. They have been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. SUSE cannot verify that actions described in these documents do what is claimed or whether actions described have unintended consequences. SUSE LLC, its affiliates, the authors, and the translators may not be held liable for possible errors or the consequences thereof.

Below we draw your attention to the license under which the articles are published.

Documentation survey